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Stochastic Modeling of Spatial Distributed Sequences 



Field of the Invention 

The present invention relates to stochastic modeling of sequences and 
more particularly but not exclusively to modeling of data sequences 
representing spatial distributions using stochastic techniques, and again but not 
exclusively for using the resultant models for analysis of the sequence. 

Background of the Invention 

Data sequences often contain redundancy, context dependency and state 
dependency. Often the relationships within the data are complex, non-linear 
and unknown, and the application of existing control and processing algorithms 
to such data sequences does not generally lead to usefiil results. 

Statistical Process Control (SPC) essentially began with the Shewhart 
chart and since then extensive research has been performed to adapt the chart to 
various industrial settings. Early SPC methods were based on two critical 
assumptions: 

i) there exists a priory knowledge of the underlying distribution (often, 
observations are assiuned to be normally distributed); and 
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ii) the observations are independent and identically distributed (i.i.d.). 

In practice, the above assumptions are frequently violated in many 
industrial processes. 

Current SPC methods can be categorized into groups using two different 
5 criterea as follows: 

1) methods for independent data where observations are not 
interrelated versus methods for dependent data; 

2) methods that are modeUspecific, requiring a priori assxunptions 
on the process characteristics and its underlying distribution, and methods that 

10 are model-generic. The latter methods tiy to estimate the underljong model 
with minimum a priori assumptions. 

Figure 1 is a chart of relationships between different SPC methods and 
includes the following: 

Information Theoretic Process Control (ITPC) is an independent-data 
15 based and model-generic SPC mettiod proposed by Alwan, Ebrahimi and Soofi 
(1998). It utilizes information theory principles, such as maximum entropy, 
subject to constraints derived from dynamics of the process. It provides a 
theoretical justification for the traditional Gaussian assumption and suggests a 
unified control chart, as opposed to traditional SPC that require separate charts 
20 for each moment. 
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Traditional SPC methods, such as Shewhart, Cumulative Sum 
(CUSUM) and Exponential Weighted Moving Average (EWMA) are for 
independent data and are model-specific. It is important to note that these 
traditional SPC methods are extensively implemented in industry. The 
independence assumptions on which they rely are frequently violated in 
practice, especially since automated testmg devices increase the sampling 
frequency and introduce autocorrelation into the data. Moreover, 
implementation of feedback control devices at the shop floor level tends to 
create structured dynamics in certain system variables. Applying traditional 
SPC to such interrelated processes increases the frequency of false alarms and 
shortens the 'in-control' average run length (ARL) in comparison to 
uncorrelated observations. As shown later in this section, these methods can be 
modified to control autocorrelated data. 

The majority of model-specific methods for dependent data are time- 
series based. The underlying principle of such model dependent methods is as 
follows: assummg a time series model family can best capture the 
autocorrelation process, it is possible to use that model to filter the data; and; 
then apply traditional SPC schemes to the stream of residuals. In particular, the 
ARIMA (Auto Regressive Integrated Movmg Average) family of models is 
widely applied for the estimation and filtermg of process autocorrelation. 
Under certain assumptions, the residuals of the ARIMA model are independent 
and approximately normally distributed, to which traditional SPC can be 
applied. Furthermore, it is commonly conceived that ARIMA models, mostly 
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the simple ones such as AR(1), can effectively describe a wide variety of 
industry processes. 

Model-specific methods for autocorrelated data can be further 
partitioned into parameter-dependent methods that require explicit estimation 
5 of the model parameters, and to parameter-free methods, where the model 
parameters are only implicitly derived, if at all. 

Several parameter-dependent methods have been proposed over the 
years for autocorrelated data. Alwan and Roberts (1988), proposed the Special 
Cause Chart (SCC) in which the Shewhart method is applied to the stream of 

10 residuals. They showed that the SCC has major advantages over Shewhart with 
respect to mean shifts. The SCC deficiency lies in the need to explicitly 
estimate all the ARIMA parameters. Moreover, the method performs poorly for 
a large positive autocorrelation, since the mean shift tends to stabilize rather 
quickly to a steady state value, and the shift is poorly manifested on the 

15 residuals (see Wardell, Moskowitz and Plante (1994) and Harris and Ross 
(1991)). 

Runger, Willemain and Prabhu (1995) implemented traditional SPC for 
autocorrelated data using CUSUM methods. Lu and Reynolds (1997, 1999) 
extended the method by using the EWMA method with a small difference. 
20 Their model had a random error added to the ARIMA model. The drawback of 
these models is in the exigency of an explicit parameter estimation and 
estimation of their process-dependence features. It was demonstrated in Runger 
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and Willemain (1995) that for certain autocorrelated processes, the use of 
traditional SPC yields an improved performance in comparison to AjRIMA- 
based methods. 

The Generalized Likelihood Ratio Test - GLRT - method proposed by 
Apley and Shi (1999) takes advantage of residuals transient dynamics in the 
ARIMA model, when a mean shift is introduced. The generalized likelihood 
ratio may be applied to the filtered residuals. The method may be compared to 
the Shewhart, CUSUM and EWMA methods for autocorrelated data, inferring 
that the choice of the adequate time-series based SPC method depends strongly 
on characteristics of the specific process bemg controlled. Moreover, in Apley 
and Shi (1999) and in Runger and Willemain (1995) it is emphasized in 
conclusion that modeling errors of ARIMA parameters have strong impacts on 
the performance (e.g., the ARL) of parameter-dependent SPC methods for 
autocorrelated data. If the process can be accurately defined by an ARIMA 
time series, the parameter independent SPC methods are superior in 
comparison to non-parametric methods since they allow efficient statistical 
analysis. If such a definition is not possible, then the effort of estimating the 
time series parameters becomes impractical. Such a conclusion, amongst other 
reasons, triggered the development of parameter-free methods to avoid the 
impractical estimation of time-series parameters. 

A parameter-free model was proposed by Montgomery and 
Mastrangelo(1991) as an approximation procedure based on EWMA. They 
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suggested using the EWMA statistic as a one step ahead prediction value for 
the IMA(1,1) model. Their underlying assumption was that even if the process 
is better described by another member of the ARIMA family, the IMA(1,1) 
model is a good enough approximation. Zhang (1998), however, compared 
several SPC methods and showed that Montgomery's approximation performed 
poorly. He proposed employing the EWMA statistic for stationary processes, 
but adjusted the process variance according to the autocorrelation effects. 

Runger and Willemain (1995, 1996) discussed the weighted batch mean 
(WBM) and the unified batch mean (UBM) methods. The WBM method 
assigns weights for the observations mean and defines the batch size so that the 
autocorrelation among batches reduces to zero. In the UBM method the batch 
size is defined (with unified weights) so that the autocorrelation remains under 
a certain level. 

Runger and Willemain demonstrated that weights estimated firom the 
ARIMA model do not guarantee a performance improvement and that it is 
beneficial to apply the simpler UBM method. In general, parameter-free 
methods do not require explicit ARIMA modeling, however, they are all based 
on the implicit assumption that the time-series model is adequate to describe 
the process. While this can be true in some industrial environments, such an 
approach cannot capture more complex and non-linear process dynamics that 
depend on the state in which the system operates, for example processes that 
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are described by Hidden Maricov Models (HMM) (see Elliot, Lakhdaraggoun 
and Moore (1995)). 



The problem of Pattern classification 

In general, the goal of pattern recognition, according to Therrien 
(1989), is to classify objects of interest into one of a number of categories or 
classes. The objects of interest are called patterns, and they may be printed as 
letters or characters, biological cells, electronic wave-forms or signals, 
"states" of a system or any nuhiber of other things that one may desire to 
classify. If there exists some set of labeled patterns, namely their class are 
known, then one has a problem in supervised pattern recognition. The basic 
procedure followed in design of a supervised pattern recognition system 
involves a portion of a set of labeled pattems being extracted and used to 
derive a classification algorithm. These pattems are called the training set. 
The remaining pattems are then used to test the classification algorithm; 
these pattems are collectively referred to as the test set. Since the conrect 
classes of the individual pattems in the test set are also known, the 
performance of the algorithm can be evaluated. In supervised pattern 
recognition problems, the results are preferably evaluated by a "teacher" or 
"supervisor'' whose output dictates suitable modifications to the algorithm — 
hence the term supervised pattern recognition. Once a desired level of 
performance is achieved (which is measured in terms of a misclassification 
rate), the algorithm can be used on initially unlabeled pattems. At this point. 



wo 02/067075 PCT/IL02/001 31 

the feedback loop involving the teacher is fonnally broken. Nonetheless it is 
usually advisable to have some spot-checking of results. Such checks can be 
accommodated either by providing an alternative classification algorithm or a 
human observer if possible. In some situations it may be feasible to wait a 
certain length of time until the correct classification is known. If the classes 
of all of the available patterns are unknown, and perhaps even the number of 
these classes is unknown, then one has a problem in unsupervised pattern 
recognition or clustering. In clustering problems, one attempts to find classes 
of patterns with similar properties where sometimes even lliese properties 
may be undefined. The unsupervised pattern recognition or clustering 
problem is a much more difficult one than the supervised pattern recognition 
problem. Nevertheless, useful algorithms have been developed in this area 
and success depends to a large extent on the ability to leam the structure of 
pattern measurement data in high-dimensional spaces. The present disclosure 
focuses on a supervised pattern recognition scheme. 

The patterns recognition approach: 
In the typical pattern recognition approach, observations first undergo feature 
transformation and then classifcation in order to arrive at an output decisions. 
An observation vector x is first transformed, by the feature transfonnation, 
into another vector y whose components are called features. The features are 
intended to be fewer in number than the observations but should collectively 
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contain most of the information needed for classification of the patterns. By 
reducing the observations to a smaller number of features, one hopes to 
design a decision rule that is more reliable. The feature vector y can be 
represented in a feature space Y similar to the way that observation vectors 
are represented in the observation space. The dimension of the feature space, 
however, is usually much lower than the dimension of the observation space. 
Procedures that analyze data in an attempt to define appropriate features are 
called feature extraction procedures. The feature vector y is passed to a 
classifier whose purpose is to make a decision about the pattern. The 
classifier essentially induces a partitioning of the feature space into a number 
of disjoint regions. If the feature vector corresponding to a pattern falls into 
region Ri, the pattern is assigned to class WL 

In general, the symbol x is used herein to represent observation 
vectors and y is iised to represent feature vectors. 

There are several ways to perform patterns recognitions. We classify 
the pattern recognition methods into different classes, as shown in the tree 
depicted in attached Fig. Al, which is in compliance with Duda et al (2001), 
We will detail those branches in the tree that are related to the present 
disclosure. 

The first classification is between supervised pattern recognition vs. 
unsupervised pattern recognition: 
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In supervised pattern recognition, the types and the number of the 
existing classes are known. In addition, the classes in the training set are 
tagged. 

By contrast, in unsupervised pattern recogfiition, the classes of all 
of the available patterns are unknown, and in some cases even the 
number of these classes is unknown. Consequently, in such situation the 
classes in the training set are not tagged and the problem becomes a 
clustering problem. 

The present disclosure concerns problems of supervised pattern 
recognition, since, as will be explained below in the description of the 
specific embodiments, the construction algorithm may make use of the 
different tagged classes in the training set to generate a different context- 
tree model for each class, for example, in the promoter recognition 
problem there are two tagged classes: "promoters'* and "non-promoters". 
We thus continue to detail the supervised pattern recognition branch. 

The second classification distinguishes between statistical and 
logical methods. 

Logical Methods are usually used when the classification 
problems involves nominal data, for instance description, that are 
discrete and without any natural notion of similarity or even ordering 
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(Duda et al (2001)). The decision tree is an example of a logical method. 
This branch is irrelevant to the present disclosure. 



Statistical Methods use statistical tools and they are based on 
feature vectors of real-valued and discrete-valued numbers. There 
can be a natural measure of distance between theses vectors. In this 
category, which is relevant to the present disclosure, we make 
another distinction between Unkmmn probabilistic models and 
Known probabilistic models. 

In unknown probabilistic models, the underlying probabilistic model 
is unknown. In many cases researchers make use of discriminant function 
to address these types of problems. Since we assume that a general context- 
tree model can well represent the different classes (although the parameters 
of the tree are unknown and need to be estimated j&om the training set), we 
do not consider this branch of methods. 

Yinown probabilistic models - the distribution function or a general 
probabilistic model, such as transition probabilistic tree, is assimied known. 
We assume that a general context-tree model can well represent the 
different classes. In this category, which is relevant to the present 
disclosure, we distinguish between the following two types of models: 
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Known parameters — models based on known parameters. This is often 
the easiest albeit the more rare problem. In this case, researches typically use 
Bayesian decision theory to classify the unknown object. 

Unknown parameters — in these cases, researches often estimate the 
parameters by known methods such as the maximum likelihood estimation 
(where parameters are assumed to be fixed) , Bayesian estimation (where 
parameters are assumed to be random variable), and Gibbs sampling. To this 
branch of methods the present disclosure belongs. This branch includes some 
other state-dependent models such as: and Markov models. Hidden Markov 
Models, Neural nets etc. Note that once the parameters of the model are 
estimated then conventional methods of classification can be used such as 
those based on Bayesian decision theory. 

Giving the above classification, note that Markov models are the closest 
methods to the suggested disclosure presented here. In the following, we briefly 
sketch the Markov models. 

Markov models 

Markov models are based on a finite memory assumption, i.e., that each 
symbol depends only on its k formers, where k is fixed. The simplest model is 
first-order Markov model, which assume that each symbol at time t depends 
only on the symbol at time t-l: P(xr=^r(/)|x,=|PF(l), X2= WC2),..., Xj., = W(i-iy) = 
P(x/= W{i)\ lV(i'l)X where state / at time t is denoted by fVi(t). 
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In order to calculate the probability that the model generates a particular 
sequence, the successive probabilities should simply be multiplied. 

Markov models of higher order simply extend the size of the memory. 
The suggested methods of the present disclosure can be viewed as a vaiying- 
order Markov model, since the order of the memory doesn't have to be fixed as 
explained latter. 

In general, Markov Models assume that the states are accessible. In 
many cases, however, the perceiver does not have access to the states. 
Consequently, Markov Model should be augmented to Hidden Markov Models 
which is a Markov model with invisible states. Hidden Markov models have a 
number of parameter whose values are set so as to best explain training patterns 
for the known category. 

An alternative model to the Markovian is the context-tree that was 
suggested by Rissanen (1983) for data compression purposes and modified 
later in Weinberger, Rissanen and Feder (1995) and in Ben-Gal et al (2000, 
2001). The tree presentation of a finite-memory source is advantageous since 
states are defmed as contexts - graphically represented by branches m the 
context-tree with variable length - and hence, requires less estimation efforts 
than those required for a Markov presentation. The context-tree is an 
irreducible set of conditional probabilities of output symbols given their 
contexts. The tree is conveniently estimated by context algorithm. The 
algorithm generates an asymptotically minunal tree fitting the data. The 
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attributes of the context-tree along with the ease of its estimation make it 
suitable for a model-generic classifier, as explain later. 



Patterns Classification In Biology 
Applications of Pattern Recognition in Biolog y 

The following is a summary of Prof. Shamirs lecture notes available at 
httD://ww w.math.tau,acil/-rshamir/ and from Higgins and Taylor (2000). The 
sequences of the family members in Biology are often compared in order to 
find properties that are shared by all members and understand how these could 
explain certain biological properties. Discovering patterns in biology is 
widespread, and has two main applications: i). Classification: the patterns are 
to be used for discriminating between family members and non-members; and 
zz) finding patterns that describe biologically important features. 

In the following we briefly describe some of the main applications of 
patterns recognition in Biology. Note that the present embodiments are 
applicable to all these problems. 

Coding sequences in prokarvotic Gene Structure 

There are more than 3 billions bases of human, albeit eukaryotic, DNA 

sequences and complete DNA sequences for dozens of species available in 

GenBank. Not all the sequences are coding, namely are a template for a 

protein. In the human genome only 3%-5% of the sequences are coding 

sequences, and the approximate proportion applies also to prokaryotic 
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sequences. Due to the size of the database, there is a need to find a way for 
automatic finding of coding sequenses. The algorithm should look for long 
sequences of codons, without any stop codon. It should scan the DNA 
sequence, looking for long ORF's (open reading firame) in all three reading 
5 firames (the first codon can start firom the first, second or the third basic). After 
detecting a stop codon, the edgorithm scans backward, searching for a start 
codon. 

It should be noticed that the coding sequence has no fixed length, a fact 
that creates difficulties on the pattern recognition algorithm. A typical 
10 prokaiyote sequence is shown in Fig. A2. 



Exons in Eukarvotic Sequences 

The gene structure and the gene expression mechanism in eukaiyotes are 
far more complicated than in prokaryotes. In typical eukaiyotes, the region of 

15 the DNA coding for a protein is usually not continuous. This region is 

composed of alternating stretches of exons and introns. During transcription, 
both exons and introns are transcribed onto the RNA, in their linear order. 
Thereafter, a process called splicing takes place, in which the intron sequences 
are excised and discarded from the RNA sequence. The remaining RNA 

20 segments, the ones corresponding to the exons, are ligated to form the mature 
RNA strand. A typical multi-exon gene has the foUowmg structure: It starts 
with the promoter region, which is followed by a transcribed but non-coding 

15 
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region called 5' untranslated region (5' UTR), Then follows the initial exon 
which contains the start codon. Following the initial exon, there is an 
alternating series of introns and internal exons, followed by the terminating 
exon, which contains the stop codon. It is followed by another non-coding 
5 region called the 5' UTR. Ending the eukaiyotic gene, there is a 

polyadenylation (poIyA) signal: the nucleotide Adenine repeating several 
times. The exon-intron boundaries (i.e., the splice sites) are signaled by specific 
short (2bp long) sequences. The 5'(3*) end of an intron (exon) is called the 
donor site, and the 3X5*) end of an intron (exon) is called the acceptor site, 
10 Fig. A3 represents a typical eukaryote sequence structure, 

Exon l^gth does not have a geometric distribution. The length seems to 
have a functional role on the splicing itself Typically, exons that are too short 
(under 50bp) leave no room for the spliceosomes (enzyms that perform the 
splicing) to operate and exons that are too long (above 300bp) are being 
15 difficult to locate. Thus another model for exon length is required. 

An algorithm for this kind of recognition, should notice the differences 
between the exons and the introns, and the connections between them. It also 
should take into consideration the different length distribution: an average 
internal exon is about ISObp long, while mtrons of the order of IKbp length are 
20 not uncommon. 
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Promoters 

A promoter is a region of DNA to which RNA polymerase binds before 
initiating the transcription of DNA into RNA: 

Not all open reading frames are transcribed into genes. The transcription 
depends on regulatory regions that control the transcription rate. In the 
transcription process, an RNA polymerase binds tightly to the promoter. The 
promoter is an * anchor' point, it pinpoints where RNA transcription should 
begin. At the stop signal the polymerase releases the RNA and detaches itself 
from the DNA. We further distinguish between two cases 

i) Prokaryotic promoters: 

The promoter contains two remote pairs of six basics. In one example 
of our invention, which is described latter, we identify E.coly promoters. In 
E.coly one can find the following consensus sequence around RNA 
transcription start point: 

nnnTTGACAnnnnnnrinnnnnnnnnnnTATAATnnnnnnNnnn. N is the transcription 
start pomt. TTGACA appears 35 bases before N, and TATAAT (also known as 
TATA box or Pribnow box) appear 12 bases before iST. We have here 2 anchor 
points for the polymerase. These sequences are short but the frequency of then 
occurrence is high. 

Since the consensus sequence mentioned above doesn't appear in each 
promoter, an algorithm should be developed in order to recognize a pattern that 
will suit all the promoters. 
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li) Evkaryotic promoters: 

Much less is known about eukaryote promoters; each of the three RNA 
polymerases has a different promoter. 

RNA polymerase I recognizes a single promoter for the precursor of 

rRNA. 

RNA polymerase 11, that transcribes all genes coding for polypeptides, 
recognizes many thousands of promoters. Most have the Goldberg-Hogness or 
TATA box that is centred around position -25 and has the consensus sequence 
5-TATAAAA-3'. Several promoters have a CAAT box around -90 with the 
consensus sequence 5'-GGCCAATCT-3', There is increasing evidence that all 
promoters for genes for "housekeeping" proteins contain multiple copies of a 
GC-rich element that includes the sequence 5'-GGGCGG-3'. Transcription by 
polymerase II is also affected by more distant elements known as enhancers. 

The promoter for RNA polymerase HI is located within the gene either 
as a single sequence, as in the 5s RNA gene, or as two blocks, as in all tRNA 
genes. 

Splice Sites 
The Splice Sites structures are: 

5' splice sites: MAG|GTRAGT where M is A or C and R is A or G 
3' splice sites: CAG|GT 

18 
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An algorithm might predict the splice sites that distinguish between the 
exons and the introns. 



Terminators 

5 At the end of the coding sequences a signal exists (the terminator) which 

means "stop making RNA here". It is composed of a sequence which is rich in 
the bases G and C and which can form a hairpin loop. This structure is more 
strongly hydrogen bonded (G-C base pairs are held together by three hydrogen 
bonds) causing the RNA polymerase to slow down. 

10 

PolvA 

Polyadenylic acid sequence of vaiymg length found at the 3' end of most 
eukaryotic mRNAs. The poly-A tail is added post-transcriptionally to the 
primary transcript as part of the nuclear processing of RNA yielding hnRNAs 
15 with 60-200 adenylate residues in the tail. In the cytoplasm the poly-A tail on 
mRNAs is gradually reduced m length. The function of the poly-A tail is not 
clear. 

It is useful to fmd the pattern of the Polyadenylic acid. 
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Proteins 

Proteins are long chains of Amino Acids (AA). There are 20 types of 
AA that serve as building blocks for proteins. Each AA has a specific cheniical 
structure. The length of a protein chain can range from 50 to 3000 AA (200 on 
the average). One of the interesting properties of proteins is the unique folding. 
The AA composition of a protein will usually uniquely determine (on specific 
environment conditions) the 3D structure of the protein (e.g., two proteins with 
the same AA sequence will have the same 3D structure in natural conditions). 
Researches of 3D structure of proteins have shoAvn that when a folded protein 
is artificially stretched to a chain, it folds back to it*s original 3D structure. 
Proteins are known to have many important fimctions in the cell, such as 
en2ymatic activity, storage and transport of material, sigaal transduction, 
antibodies and more. All proteins whose structure is known are stored in the 
Protein DataBank (PDB), which contains more than 10,000 proteins. 
A protein has multiple levels of structure: 
Primary structure - Chain of Amino Acids (1 dimensional). 

Secondar y structure - Chains of structural elements, most important of 
which are a-helices and P-sheets. 

Tertiary and Quatern ary structure - 3D structure, of a single AA chain 
or several chains, respectively. 

The pattern recognition of proteins is done in these three levels: primary 
structure (20 symbols), secondary structure and 3D structure. 
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Classification Concepts in Biology 

Pattern-Based Classification Approaches 
Brazma et al (1998) proposed a general three-step approach for 
discovering patterns firom protein and DNA sequences: 

i) Choose a solution space (a set of patterns that the method should 
discover). 

ii) Define a fitness fimction reflecting how well a pattern fits ttie 
input sequence. 

iii) Develop an algorithm^ which gets a set of input sequences, and 
returns the patterns with the highest fitness 

According to Jonassen (2000), there are two main groups of sequence patterns: 
Deterministic patterns (Regular expression type patterns) and Probabilistic 
patterns. Next, both types of patterns are described. 

Deterministic Patterns 

In deterministic pattern methods, a sequence either matches or does not 
match the pattern. 
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A very simple type of patterns is a substring pattern — a sequence 
matches a substring pattern if it contains the substring. When matching a 
sequence 

against a substring pattern there are two kinds of matching: exact and 
approximate. 

When using approximate matching, a sequence matches the pattern if it 
contains a substring approximately equal to the pattern. "Approximately equal" 
means that only some of the characters are the same. 

In the approximate matching, a distance is deJHned between a pattern and 
a substring, and an upper limit on the distance (a threshold) is set 

Typical ways to measure the distance between two strings are: 

i) Hamming distance - counting the number of character changes 
needed to transform one into the other (number of mismatches). 
The measure is relevant only for two strings of the same size. 

ii) Gaps based methods. In these methods in order to match the two 
strings, it's allowed to insert or delete characters in addition to 
their substitutions (namely, gaps are allowed). 

For example: ACCDDECA versus ACDDECA 

Without gaps: ACCDDECA 
II I 

ACDDECA (3 matches) 
22 
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With gaps: ACCDDECA 
I I I III I 

A_CDDECA (7 matches, one gap of 1 length) 
In the above methods, the distance is a function of the number of 
mismatches, the nmnber of gaps and their length. 

Alternatively, there are methods that define a score between a pattern 
and a sub string and set a lower limit on the score to be allowed. 

The advantages of the deterministic patterns are that they are very 
simple, mathematically pure and easy to interpret. Its disadvantage is that often, 
there is more one pattern that matches all the family members. 

Probabilistic Patterns 
In these patterns for each position in the pattern, a score is assigned to 
each of the symbols (e.g., four basics in D.N.A, or twenty amino acids in 
proteins). Additionally, penalties (or probabilhies) to insertions or deletions in 
each pattern position are assigned. These can also be seen as generalization of 
substring patterns. These methods assign a score (probability) to a match to a 
sequence. 

This category contains: Profiles (Position Specific Scoring Matiixes), 
HIVIM (Hidden Markov Models) and Probabilistic Trees. All these methods are 
relative methods to the suggested algorithm as further detailed below. 

In the probabilistic patterns the probabilities are often calculated from: 
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i) the distribution of the symbols (e,g., amino acids or basics) in the 
columns of the probability matrices. 

ii) some external information from substitution matrices. 

iii) Dirichlet miTctures. 

The invention presented here derives the probability measures of 
symbols based on their type, position and context. However, the set of 
probabilities to be used for classification is determmed by the algorithm in a 
different manner than the above-mentioned methods. 

The advantages of probabilistic patterns are that they can be used when 
it is not possible to define one single regular expression type pattern, which 
matches all family sequences. They can assign dififerent scores and different 
gap penalties to each symbol in different pattern position. Their disadvantages 
are that they contain many more parameters to be estimated, and in order to 
estimate then: values, a large number of family members is needed. In addition, 
noisy examples can enter the leaming process, and therefore unmatched 
sequences will be recognized as matched (see Higgins and Taylor, 2000, the 
contents of which are hereby incorporated by reference.). 

Pattern and Sequence Driven Algorithms 
Branza et al (1998) identified two main algorithmic approaches: Pattern 
Driven (PD) methods and Sequence Driven (SD) methods. 
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Pattern driven methods : In the simplest form, PD methods enumerate all 
patterns in the solution space calculating each pattern's fitness so that the best 
ones can be output. There are more sophisticated methods that contain 
mechanisms for avoiding lookmg at all patterns in the solution space. 

Sequence driven methods : In these methods different pairs of sequences are 
compared in order to find similarities as patterns. Hiese methods are very 
similar to the sequence alignment methods. Examples for sequence driven 
methods are: 

i) Smith and Smith (1990) developed a method that first computes 
the similarity between all pairs of input sequences. The most 
similar pair is input to a local pair-wise alignment method that is 
based on dynamic programming. This algorithm outputs a pattern 
common to the two sequences. The two sequences aligned are 
replaced by their common pattern and the procedure is repeated 
until there remains only one pattern matching all the input 
sequences. 

ii) The Pratt programs gets as input a set of sequences and finds 
patterns matching at least a minimum number of the sequences 
that is defined by the user. The user can also input constrains on 
the patterns (the solution space). Pratt uses a two steps search 
(initial pattern search and pattern refmement) for finding 
conserved patterns (patterns that match at least the minimum 
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number of the sequences) from the chosen solution space havmg 
maximum fitness. 

The invention presented here is related to sequence driven methods, since 
patterns are not enumerated. Instead, the training sequence are used to 
construct the tree model which represent probabilistically a set of pattems. 

Fitness functions 

Fitness functions usually reflect how well a pattern fits the input 
sequence. Some fitness function are designed according to the following 
principles: 

i) According to Jonassen (2000), information content of a pattern is 
a measure of the information gained about an unknown sequence 
when one is told that the sequence matches the pattern. 

ii) Brazma et al (1996) described the minimum description length 
(MDL) principle, which assigns a score to a pattern that depends 
on the pattern's information content and on how many sequences 
it matches. The user can select parameters in order to slant the 
optimum towards strong pattems matching few sequences or 
towards weaker pattems matching many sequences. 

iii) PPV (Positive Predictive Value) - It is usually used by the PraU 
program when the ahn is to find pattems to be used for 
classification. 
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The invention presented here makes use of the tree model to derive a 
fitness function to each sequence. 

Training the Model 
In methods of supervised pattern-recognition (and particularly in those 
5 related to biology pattern recognition), a portion of the set of labeled patterns is 
extracted and used to derive a classification algorithm. These patterns comprise 
the training set. 

The remaining patterns are then used to test the classification algorithm; 
these patterns are referred to as the test set. Since the correct classes of the 

10 individual patterns in the test set are also known, the performance of the 
algorithm can be evaluated. There is a tradeoff between the classification of the 
training samples and the performance of the algorithm on new objects. Perfect 
classification of the training samples can be achieved, but in this case new 
objects that were not part of the training set usually are not well recognized. 

15 This situation is known as overfitting, and it should be avoided. Thus, It is very 
important to determine how to adjust the complexity of the model: it shouldn't 
be so simple that it cannot explain the differences between the categories, yet, 
not so complex as to give poor classification on novel patterns. 
There are several ways to evaluate the algorithm. Two of them are: 

20 iv) Parametric Models: The generalization error rate is computed 

from the assumed parametric model. A test set is not needed in 
these models. 
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v) Simple Cross-Validation: The set of labeled samples D are split 
into two parts, as mentioned above. One part is used as the 
traditional training set. The other part is the validation set, which 
is used to estimate the error rate. The classifier is trained until one 
reaches a minimum of this error. 

vi) General Cross-Validation: in more detail.general cross validation 
methods, the performance is based on multiple, independently 
formed validation sets. For example, in the m-fold cross 
validation^ the training set is randomly divided into m disjoint 
sets of equal size. The classifier is trained m times, each time 
with a different set held out as a validation set. The estimated 
performance is the mean of these m errors. 
^ *e Anti-cros s validation the adjustment of parameters is halted 
when the validation error is the first local maximum. 

Conventional Performance Measure (taken from Jonassen, 2000) 
When a partem is to be used for classification, it should ideally match all 
family members and no other sequences. Most often, however, the pattern fails 
to match some member sequences (called false negatives), and it may match 
some sequences outside the family (called false positives). The fewer false 
negatives, the more sensitive the pattern is said to be, and the fewer false 
positives, the more specific it is. Ideally, a pattern should have zero false 
positives and negatives, 
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An estimate of the number of matches in a sequence database can be 
found by multiplying the probability that one random sequence matches the 
pattern by the number of sequences in the database. In order to calculate the 
probability, it is often assumed that random sequences are generated using a 
5 specific probabilistic model. Sternberg (1991) considered all the patterns in the 
PROSITE database and obtained a clear correlation between the expected 
number of false positives and the actual number, i.e., the number of unrelated 
sequences in the SWISS-PROT database matching the pattern. 

Denoting the number of true positives (sequences in the family matchmg 
10 the pattem) by TP and the nxmiber of false negatives by FN, the sensitivity of a 
pattern can be defined as. 

Sensitivity = TP/(TP+FN). 

The sensitivity measures of how big a proportion of the family sequences are 
'picked up by'(matched by) the pattem. Similarly, the specificity of the pattem 
15 can be defined as. 

Specificity = TN/(TN+FP), 

(where TN and FP are, respectively, the number of true negatives and the 
number of false positives) which measures of how big a proportion of the 
sequences outside the family are not matched by the pattem. Yet, another 
20 usefiil measure is the positive predictive value (PPV), which determines how 
big a proportion of the sequences matching the pattem are actually in the 
family, i.e., 
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PPV = TP/(TP+FP). 

The value range for all the above measures is from zero to one - one 
being the best possible. When evaluating patterns to be used for classification, 
one needs to use more than one of the measures. This can be illustrated by two 
5 degenerate cases, (1) the empty pattern matching any member, and (2) a pattern 
matching one single member in the family. Pattern (1) has perfect sensitivity, 
but very bad specificity and PPV, while pattern (2) has perfect specificity and 
PPV, but bad sensitivity. In practice, one often needs to make a trade-oflF 
between sensitivity and specificity when choosing which pattern to use for a 

10 family. One way to evaluate a probabilistic pattern's ability to discriminate 
between family members and other sequences is to find a cut-off on the score 
that gives the same number of false positives and false negatives. Tatusov et al. 
(1994) evaluated alternative ways of finding weight matrices fi:om local 
ungapped alignments using this approach. Another approach is to achieve 

15 ma?dmum TP given very high TN. This approach relies on the assumption that 
the proportion of the "negatives" in the population is very high. 



Used Models for Pattern Classification in Biology 

Some general probabiUstic methods are often used to recognize a 
number of families in biology. Some of the most common methods sharing 
the same idea are Profiles (Position Specific Scoring), PWM (Positional 
Weight Matrix) and l^AfA/ (Weight Matrix Model), This methods are often 
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referred as Non-Homogeneous models, which mean that the distribution of 
the symbols is different between one position in the pattern to the other. 

Gribskov et al (1987) initially suggested the profiles Matrixes. 
Profile is a scoring matrix giving a position specific scoring and specific 
5 gap penalties for each symbol (amino acids or basics). The matrix, known 

as the PWM, is a table of statistics,^^,/, of the frequencies of the symbol b in 
position / of the known sequences (e.g., promoter or coding). This model 
assumes that positions are independent. GENSCAN uses different signal 
models to model different functional units. One of the models is WMM in 
10 which every position has its own specific independent distribution. It is 
used for modeling polyadenylation signals, translation initiation signal, 
translation termination signal and promoters. 

Another model is the weighted array model (WAM). The WAM 
model is a generalization of the WMM model that allows dependencies 

15 between adjacent positions. The WAM model is used for the recognition of 
the splice sites. Correct recognition of these sites greatly enhances ability to 
predict correct exon boundaries. This modeling of splice sites gave 
GENESCAN a substantial improvement in performance. This model can be 
seen as the extension of HMM, smce each position has its own HMM 

20 network. 

As was mentioned before, a hidden Markov models (HMM) is a 
Markov chain in which the states are not directly observable. Instead, the 
output of the current state is observable. The output symbol for each state is 
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randomly chosen from a finite output alphabet according to some 
probability distribution. 

Ohler et al (1999) used three interpolated Markov Chains of 
different order, which are trained on coding, non-coding and promoter 
sequences. 

A Generalized Hidden Markov Model (GHMM) generalizes the 
HMM as follows: in a GHMM, the output of a state may not be a single 
symbol. Instead, the output may be a string of finite length. For a particular 
current state, the length of the output string as well as the output string itself 
might be randomly chosen according to some probability distribution. The 
probability distribution need not be the same for all states. For example, one 
state might use a weight matrix model for generating the output string, 
while another might use a HMM. Formally a GHMM is described by a set 
of four parameters: 

i) A finite set Q of states. 

ii) Initial state probability distribution Ilq. 

iii) Transition probabilities Tij for ij eQ. 

iv) Length distributions f of the states {fg is the length 
distribution for state q). 

v) Probabilistic models for each of the states, according to 
which, output strings are generated upon visiting a state. 
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The probabilistic model for gene structure as suggested by Berge and 
Karlin (1997), is based on a GHMM. 



Another model is the probabilistic tree model. In the probabilistic tree 
5 model, similarly to the Markov models (MM), the probability of each symbol 
depends on its k predecessors. The deferent between MM and probabilistic 
trees is that in MM k is fixed, and in the probabilistic trees k is changeable. In 
practice, k is selected to attain the shortest contexts for which the conditional 
probability of a symbol given the context is practically equal to the conditional 
10 probability of that symbol given the whole data. For example, Bejerano and 
Yona (1998), iised probabilistic suffix trees in order to model protein families. 
The construction of the suffix tree, the parameterization and growth is different 
than the tree model presented here (for example, the construction of suffix tree 
requires multi-passes as oppose to a single pass in the context tree, moreover, 
15 "partial leafs" that might have a vital importance for classification are ignored 
in suflTix trees). Vert (2001) used a similar tree model for text clustering. The 
suggested invention is a relative to these methods yet differs from them as 
indicated below. 

In the following, we indicate a list/survey of models that were 
20 investigated for several classification applications. 
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Proteins 

Gelfand (1995) reviewed methods for prediction of functional sites, 
tRNA and protein coding genes. Fickett and Tung (1992) classified the 
protein coding algorithms to five groups: 

i) Codon usage - using the firequencies of the 64 codons. See, for 
example, Staden andMeLachlan (1982), Gribskov, Deverux and 
Burgess (1984), Hinds and Blake (1985), Kolaskar and Reddy 

(1985) , Borodovsky et al (1986), Claveric and Bougueleret 

(1986) , Fichant and Gautier (1987), Lapedes et al (1990). 

ii) Encoded amino acid sequence. See, for example, McCaldon and 
Argos (1988), Tramontano and Macchiato (1986), Moody and 
Fristensky(I987). 

iii) Base compositional bias between codon positions — These 
algorithms consider the different between the three-codon 
positions. See, for example. Shepherd (1981), Picket (1982), 
Bibb, Findlay and Johnson (1984), Ahnagor (1985), Trifonov 

(1987) . 

iv) Imperfect periodicity in base occurrences. See, for example, 
Michel (1986), Silverman and Linsker (1986), Arques and 
Michel (1987), Konopka (1990). 
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v) Other global patterns. See, for example, Erickson and Altman 
(1979), Shulman, Steinberg and Westmoreland (1981), Blaisdell 
(1983). 

Promoters 

Ohler & Niemann (2001) made a review of the identification and 
analysis of eukaryotic promoters: 

Discovering motifs 

Ohler and Niemann (2001) divided the discovering motifs methods 
into two main categories — Alignment methods and Enumerative or 
exhaustive methods . 

Alignment methods aim to identify unknown signals by a significant 
local multiple alignment of all sequences. Alignment approaches deliver a 
model of the motifs (such as a weight matrix) built from the alignment. 
They require different statistics depending on how often a pattern may be 
present in the sequences. 

There are Direct multiple alisjiment methods, such the consensus 
algorithm, which aligns sequences one by one and optimize the information 
content of the weight matrix constructed from the alignment. 

There are also statistical approach methods . They consider the start 
positions of the motifs in the sequences to be unknown and perform a local 
optimization to determine which positions deliver the most conserved motif. 
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Examples of these methods are: Gibbs sampling (Lawrence et al (1993)) 
and expectation maximization in the MEME system (Bailey and Elkan 
(1995)). 

In the other hand, the enumerative or exhaustive methods aim to 
examine all oligomers of a certain length and report those that occur far 
more often than expected from the overall promoter sequence composition. 
These methods give a list of over-represented oligomers, possibly already 
grouped to form consensus sequences. They have to use an elaborate 
background model to judge the importance of frequent patterns. They also 
need to have the size of the motif specified in advance. 

The set of input data 

The methods are often applied on a set of promoters that were first 
grouped together using gene expression measurements. A new way to look 
at the data is to cluster genes based on both expression levels euid common 
motifs. An alternative approach is to identify elements by analyzing 
promoters of the same gene from approximately ten different related 
species. 

Promoters Recognition Algorithms: 

Ohler & Niemann (2001) divided these algorithms into two main 
groups based on different search principles: 



i) Search by signal algorithms — making predictions based 
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on the detection of core promoter elements such as 
TATA box or the initiator and/or transcription factor 
binding sites outside the core. 

ii) Search by content algorithms - identifying regulatory 
regions based on the sequence composition of promoter 
and nonpromoter examples. 

There are also methods that combine both ideas ~ looking for signals 
and for regions of specific composition. Other methods and ideas for 
finding the promoters are: 

i) Providing an accurate prediction of the TSS (transcription start 
site). This idea is good only for small regions known to contain a 
promoter (see Zhang, 1998). 

ii) Providing specific prediction of regulatory regions using a search 
by content approach. The method gives no information regarding 
whether the affected gene is on the leading or lagging strand, or 
where the TSS itself is located within the region (see Scherf, 
2000). 

iii) Constructing specific, rather than general, promoter models for 

groups of genes as muscle-active genes known by experiment to 

contain specific combination of regulatory elements (see 

Wasserman and Picket, 1998). 
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Davuluri, Grosse and Zhang (2001) presented a set of discriminant 
functions that can recognize promoters in the human genome. They explain 
the implementation of these functions into a decision tree that constitutes a 
new program called FirstEF. They obtained a TP=^ 86% with FN=17%. 

5 Fickett and Hatzigeorgiou (1997) provides a review for the 

eukaryotic promoter recognition methods: 

Hsu et al (1994) and Wright et ah (1991) used consensus sequences-- 
giving the most preferred base at each position within a site. This approach 
loses much of the information and is of marginal utility. 

10 PWM (positional weight matrix) assigns a weight to each possible 

nucleotide at each position of a putative binding site and gives as a site 
score the sum of these weights. PWM are more informative, and are used 
when enough information is available to build them. 

Bucher (1990) developed an iterative algorithm for weight matrix 
15 refinement in order to find motifs in the promoters. This kind of model 
assumes nonhomogenous structure, which means that the symbols 
distribution is different between the positions in the pattem. 

Interpulated HMM - In these techniques, the estimated probability of 
a sequence is the linear or other interpolation between all conditional 
20 probabilities with increasing context length. Ohler et al. (1999) used three 
mterpolated Markov Chains of different order, which are mainly used to 
recognize eukaiyotic promoters. They compared promoters versus non 
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promoters (coding sequences, intron sequences and both coding and non- 
coding sequences. The best accuracy was achieved in promoters versus 
CDS (T.N=95%, T.P =88.9%, AC=91.95%) 

Coding 

Codon frequencies in coding regions 

An informative method to determine coding regions, takes advantage 
of the frequencies at which the various codons occur in coding regions. For 
example, the amino acids Leucine, Alanine and Tryptophan are coded by 6, 
4 and 1 different codons respectively. In a translation of a uniformly 
random DNA sequence, these amino acids should occur in the ratio 6:4:1, 
but in a protein they occur at a different ratio - 6.9:6.5:1. Therefore coding 
DNA is not random. Another example of the non-uniformity of coding 
DNA is the fact that A or T occurs in the position of a codon in a rate 
over 90% (these statistics vary for different species). 

Finding long ORFs 

Another way to distinguish coding regions from non-coding regions, 
is to examine the frequencies of stop codons. Assxmiing a uniform random 
distribution, a stop codon is expected to be observed every 64/3=21.33 
codons (since there are 3 stop codons). Average proteins are much longer, 
being coded by about lOOObp (base pairs). Each coding region has only one 
stop codon, which terminates the region. Therefore, one way to detect the 
coding regions, is to look for long sequences of codons, without any stop 
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codon. The algorithm that uses the above idea scans the DNA sequence, 
looking for long ORFs in all three reading frames. Upon detecting a stop 
codon, the algorithm scans backward, searching for a start codon. This 
algorithm will fail to detect very short genes, as well as overlapping long 
ORFs on opposite strands. Moreover, there are a lot more ORFs than genes. 
For example, one can find 6500 ORFs in the DNA of the bacterium Kcoli 
while there are only 4400 genes. 



ORFs as Markov chains 

Assuming one finds all ORFs in a sequence, he can use codon 
frequencies to find which ORFs are coding and which are non coding open 
reading frames (NORFs). This is done by translating each ORF into a codon 
sequence and obtaining a 64-state Markov chain. One can use a state for 
each codon rather than a state for each amino acid, because codons are more 
informative than their translations (there might be a preference for a specific 
codon in gene expression over other codons that encode the same amino 
acid). The transition probabilities are the probabilities for each codon to 
follow any other codon in a coding region. Using this model, one can 
compute the probability that a given ORF is really a coding region. 
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Exons 
Spliced Alignment 

Given a genomic sequence and a set of candidate exons, the spliced 
alignment algorithm (see Geifand, Mironov and Pevzner, 1996) explores all 

5 possible exons assemblies and finds a chain of exons which best fits a 
related target protein. The set of candidate exons is constracted by 
considering all blocks between candidate acceptor and donor sites (i.e., 
between AG dinucleotide at the intron-exon boundary and GU dinucleotide 
at exon-mtron boundary) and further filtration of this set To avoid losing 

10 true exons, the filtration procedure is designed to be veiy gentle, and the 
resulting set of blocks inay contain a large number of false exons. Instead of 
trying to identify the correct exons by fiirther pursuit of statistical methods. 
The algorithm considers all possible chains of candidate exons and finds a 
chain with the maximum global similarity to the target protein. 

15 

Network formulation 

The spliced alignment problem can be formulated in network terms. 
The set of blocks is represented by a set of nodes pi. A node v; is connected 
to a node v, if p, < Py. The requested solution is the best alignment between 
20 the reference sequence (J) and a path in the network. 

Further information is available firom Ben-Gal I., Shmilovici A., Morag 
G., "Design of Control and Monitoring Rules for State Dependent Processes", 
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Journal of Manufacturing Science and Production, 3, NOS. 2-4, 2000, pp. 85- 
93; also Ben-Gal I., Morag G., Shrnilovici A., "Statistical Control of 
Production Processes via Context Monitoring of Buffer Levels", submitted, 
(after revision^: B en-Gal L, Singer G., "Integrating Engineering Process 
5 Control and Statistical Process Control via Context Modeling", submitted, 
(after revision): Shrnilovici A. Ben-Gal L, "Context Dependent ARMA 
Modeling", Proc, of the 21st IEEE Convention. Tel-Aviv, Israel, April li-12, 
2000, pp. 249 - 252; Morag G., Ben-Gal I.,"Design of Control Charts Based on 
Context Universal Model", Proc. of the Industrial Engineering and 

10 Management Conference. Beer-Sheva, May 3-4, 2000, pp. 200 - 204; Zinger 
G., Ben-Gal I., "An Information Theoretic Approach to Statistical Process 
Control of Autocorrelated Data", Proc. of the Industrial Engineering and 
Management Conference, Beer-Sheva, May 3-4, 2000, pp. 194 -199 (In 
Hebrew); Ben-Gal I., Shrnilovici A. Morag G., "Design of Control and 

15 Monitoring Rules for State Dependent Processes", Proc. of the 2000 
International CIRP Desisn Seminar. Haifa, Israel, May 16-18, 2000, pp, 405 - 
410; Ben-Gal I., Shrnilovici A., Morag G., "Statistical Control of Production 
Processes via Monitoring of Buffer Levels", Proc. of the 9th International 
Conference on Productivitv & Quality Research. Jerusalem, Israel, June 25-28, 

20 2000, pp. 340 - 347; Shrnilovici A., Ben-Gal L, "Statistical Process Control for 
a Context Dependent Process Model", Proc. of the Annual EURO Operations 
Research conference. Budapest, Hungary, July 16-19, 2000; Ben-Gal L, 
Shrnilovici A., Morag G., "An Information Theoretic Approach for Adaptive 
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Monitoring of Processes", ASI2000, Proc. of The Annual Conference of ICIMS 
- NOE and IIMB. Bordeaux, France, September 18-20, 2000; Singer G. and 
Ben-Gal I., "A Methodology for Integrating Engineering Process Control and 
Statistical Process Control", Proc, of The International Conference on 
5 Production Research. Prague, Czech Republic. 29 July - August 3, 2001; and 
Ben-Gal L, Shmilovici A., "Promoters Recognition by Varying-Length Markov 
Models", Artificial Intelligence and Heuristic Methods for Bioinformatics, 30 
Sept. — 12 Oct., San-Miniato, Italy. The contents of each of the above 
documents is hereby incoiporated by reference. 

10 

Summary of the Invention 

According to a generalized aspect of the present invention there is thus 
provided an algorithm, which can analyze strings of consecutive symbols taken 
from a finite set. The symbols are viewed as observations taken from a 

15 stochastic source with unknown characteristics. Without a priori knowledge, 
the algorithm constructs probabilistic models that represent the classes, 
dynamics and interrelations within the data. It then monitors incoming data 
strings for compatibilities or incompatibilities with the models that were 
constructed. Compatibilities between the probabilistic model and the incoming 

20 strings are identified and analyzed to trigger appropriate actions such as correct 
classification. Incompatibilities between the probabilistic model and the 
incoming strings are identified and analyzed to trigger appropriate actions 
(application dependent). 
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According to a first aspect of the present invention there is provided 
apparatus for building a stochastic model of a data sequence, said data 
sequence comprising spatially related symbols selected from a finite symbol 
set, the apparatus comprising: 

an input for receiving said data sequence, 

a tree builder for expressing said symbols as a series of coxmters within 
nodes, each node having a counter for each symbol, each node having a 
position within said tree, said position expressing a symbol sequence and each 
counter indicating a number of its corresponding symbol which follows a 
symbol sequence of its respective node, and 

a tree reducer for reducing said tree to an irreducible set of conditional 
probabilities of relationships between symbols in said input data sequence. 

Preferably, the tree reducer comprises a tree pruner for removing from 
said tree any node whose counter values are within a threshold distance of 
counter values of a preceding node in said tree. 

Preferably, the various tree construction parameters are user definable. 
Thus, such tree construction parameters include threshold distance and tree 
construction parameters are user selectable. Preferably, said user selectable 
parameters fiirther comprise a tree maximum depth. 

Preferably, said tree construction parameters ftirther comprise an 
algorithm buffer size 
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Preferably, said tree construction parameters further comprise values of 
pruning constants. 

Preferably, said tree construction parameters forther comprise a length 
of input sequences. 

5 Preferably, said tree construction parameters fiirther comprise an order 

of input symbols. 

Preferably, said tree reducer further comprises a path remover operable 
to remove any path within said tree that is a subset of another path within said 
tree. 

10 Preferably, said sequential data is a string comprising consecutive 

symbols selected from a finite set. 

The apparatus preferably further comprises an input string permutation 
unit for carrying out permutations and reorganizations of the input string using 
external information about a process generating said string. 
15 Preferably, said string is a nucleic acid sequence. 

Preferably, said string is a promoter and said tree is operable to identift^ 
othar promoters. 

Preferably, said string is a string of coding DNA, and said tree is 

operable to identify other coding strings. 

20 Preferably, said string is a string of non-coding DNA, and said tree is 

operable to identify oflier non-coding strings. 
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Preferably, said string is a DNA string and said tree is operable to 
identify poly-A terminators. 

Preferably, said string has a given property, and said tree is operable to 
identify other strings having said given property. 

5 Preferably, said string is an amino-acid sequence and the symbols 

comprise at least some of the 20 amino-acids. 

Preferably, said string is an amino acid string and said tree is operable to 
identify at least one of primary, secondary and three dimensional protein 
structure. 

10 Preferably, said string has a given property, and said tree is operable to 

identify other strings having said given property. 

Preferably, said nucleic acid sequence is a promoter sequence and 
another nucleic acid sequence is a non-promoter sequence, wherein said 
stochastic modeler is operable to build models of said promoter sequence and 
15 said non-promoter sequence and said comparator is operable to compare a third 
nucleic acid sequence with each of said models to determine whether said third 
sequence is a promoter sequence or a non-promoter sequence. 

Preferably, said nucleic acid sequence is a coding sequence and another 
nucleic acid sequence is a non-coding sequence, wherein said stochastic 
20 modeler is operable to build models of said coding sequence and said non- 
coding sequence and said comparator is operable to compare a tiiird nucleic 
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acid sequence with each of said models to detennine whether said third 
sequence is a coding sequence or a non-coding sequence. 

Preferably, said nucleic acid sequence is a repetitive sequence and 
another nucleic acid sequence is a non-repetitive sequence, wherein said 
5 stochastic modeler is operable to build models of said repetitive sequence and 
said non-repetitive sequence and said comparator is operable to compare a third 
nucleic acid sequence with each of said models to determine whether said third 
sequence is a repetitive sequence or a non-repetitive sequence. 

Preferably, said nucleic acid sequence is a non-coding sequence and 
10 another nucleic acid sequence is a coding sequence, wherein said stochastic 
modeler is operable to build models of smd non-coding sequence and said 
coding sequence and said comparator is operable to compare a third nucleic 
acid sequence wifli each of said models to determine whether said third 
sequence is a non-coding sequence or a coding sequence. 

15 Preferably, said nucleic acid sequence is an exon sequence, wherein said 

stochastic modeler is operable to build a model of said exon sequence and said 
comparator is operable to compare a second nucleic acid sequence with said 
model to detennine whether said second sequence is an exon sequence. 

Preferably, said nucleic acid sequence is an intron sequence, wherein 
20 said stochastic modeler is operable to build a model of said intron sequence and 
said comparator is operable to compare a second nucleic acid sequence with 
said model to detemiine whether said second sequence is an intron sequence. 
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Preferably, said stochastic model is refinable using further data 
sequences, thereby to define a structure of a common attribute of said data 
sequences. 

According to a second aspect of the present invention there is provided 
5 apparatus for determining statistical consistency in spatially related data 
comprising a finite set of symbols, the apparatus comprising 

a sequence input for receiving said spatially related data, 

a stochastic modeler for producing at least one stochastic model fi-om at 
least part of said spatially related data, 

10 and a comparator for comparing said stochastic model with a prestored 

model, thereby to determine whether there has been a statistical change in said 
data. 

Preferably, said stochastic modeler comprises: 

a tree builder for expressing said symbols as a series of counters within 
15 nodes, each node having a counter for each symbol, each node having a 
position within said tree, said position expressing a symbol sequence and each 
counter indicating a number of its corresponding symbol which follows a 
symbol sequence of its respective node, and 

a tree reducer for reducing said tree to an irreducible set of conditional 
20 probabilities of relationships between symbols in said input data sequence. 
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Preferably, said prestored model is a model constructed using another 
part of said spatially related data. 

Preferably, said comparator comprises a statistical processor for 
determining a statistical distance between said stochastic model and said 
5 prestored model. 

Preferably, said comparator comprises a statistical processor for 
determining a difference in statistical likelihood between said stochastic model 
and S2ud prestored model. 

Preferably, said statistical distance is a relative complexity measure. 
10 The statistical distance may comprise an SPRT statistic, or an MDL statistic or 
a a Multinomial goodness of fit statistic or a Weinberger Statistic, or a KL 
statistic, or any other suitable statistic. 

Preferably, said tree reducer comprises a tree pnmer for removing from 
said tree any node whose counter values are within a threshold distance of 
15 counter values of a preceding node in said tree. 

Preferably, said threshold distance, and other tree construction 
parameters, are user selectable. 

Preferably, tree construction parameters further comprise a tree 
maximum depth. 

20 Preferably, tree construction parameters fiirther comprise an algorithm 

buffer size. 
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Preferably, tree construction parameters further comprise values of 
pruning constants. 

Preferably, user selectable parameters ftirther comprise a length of input 
sequences. 

Preferably, tree construction parameters further comprise an order of 
input symbols. 

Preferably, said tree reducer further comprises a path remover operable 
to remove any path within said tree that is a subset of another path within said 
tree. 

Preferably, said data comprises a nucleic acid sequence. 

Preferably, said data comprises an amino-acid sequence. 

Preferably, said sequential data is an oulput of a medical sensor sensing 
bodily functions. 

Preferably, said nucleic acid sequence is a promoter sequence and 
another nucleic acid sequence is a non-promoter sequence, wherein said 
stochastic modeler is operable to build models of said promoter sequence and 
said non-promoter sequence and said comparator is operable to compare a third 
nucleic acid sequence with each of said models to determine whether said third 
sequence is a promoter sequence or a non-promoter sequence. 

Preferably, said nucleic acid sequence is a coding sequence and another 
nucleic acid sequence is a non-coding sequence, wherein said stochastic 
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modeler is operable to build models of said coding sequence and said non- 
coding sequence and said comparator is operable to compare a third nucleic 
acid sequence with each of said models to determine whether said third 
sequence is a coding sequence or a non-coding sequence. 
5 Preferably, said nucleic acid sequence is a repetitive sequence and 

another nucleic acid sequence is a non-repetitive sequence, wherem said 
stochastic modeler is operable to build models of said repetitive sequence and 
said non-repetitive sequence and said comparator is operable to compare a third 
nucleic acid sequence with each of said models to determine whether said tturd 
10 sequence is a repetitive sequence or a non-repetitive sequence. 

Preferably, said nucleic acid sequence is a non-coding sequence and 
another nucleic acid sequence is a non-non-coding sequence, wherein said 
stochastic modeler is operable to build models of said non-coding sequence and 
said non-non-coding sequence and said comparator is operable to compare a 
15 third nucleic acid sequence with each of said models to determine whether said 
third sequence is a non-coding sequence or a non-non-coding sequence. 

Preferably, said data sequence comprises image data of a first image. 
Preferably, said distance is indicative of a statistical distribution within 
said image. 

20 Preferably, the apparatus ftirther comprises an im^e comparator for 

comparing said statistical distribution with a statistical distribution of anotiier 
image. 
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Preferably, the othear image is of a same view as said first image taken at 
a different time, said distance being indicative of time dependent change. 

Preferably, said image data comprises medical ima^ng data, said 
statistical distance being indicative of deviations of said data from an expected 
5 norm. 

The embodiments are preferably applicable to a database to perform 
data mining on said database. 

Preferably, said stochastic model is constiiicted from descriptions of a 
plurality of enzymes for canying out a given task, said model thereby 
10 providing a generic structural description of an enzyme for canying out said 
task. 

Preferably, the model is usable to analyze results of a nucleic acid micro 

array. 

Preferably, Ihe model is usable to analyze results of a protein 
15 microarray. 

According to a third aspect of tiie present invention tiiere is provided a 
method of designing a protein for carrying out a predetennined task, tiie 
method comprising: 

taking a plurality of proteins known to cany out said predetermined 

20 task. 
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constructing a stochastic model using an amino acid sequence of said 
plurality of proteins, 

using said stochastic model to predict a protein sequence. 

According to a fourth aspect of the present invention there is provided a 
5 . method of designing a protein for carrying out a predetermined 

task, the method comprising: 
taking a plurality of proteins known to carry out said predetermined 
task, 

constructing a stochastic model using the 3D structure of said plurality 
10 of proteins, 

using said stochastic model to determine a protein structure. 

According to a fiflh aspect of the present mvention there is provided a 
method of distinguishing between biological sequences of a first kmd and 
biological sequences of a second kind, each kind being expressible in terms of 
15 a same finite set of symbols, the method comprising: 

obtaining a statistically significant set of sequences of said furst kind and 
building a stochastic model thereof, 

obtaining a statistically significant set of sequences of said second kind 
and building a stochastic model thereof, and 
20 taking a further sequence and comparing it with each stochastic model to 

determine whether it belongs to either set 
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Preferably, said biological sequences are nucleic acid sequences. 

Preferably, said biological sequences are amino acid sequences. 

Preferably, the sequences of said first kind are promoter sequences. 

Alternatively or additionally, the sequences of said Gist kind are coding 
5 and the sequences of said second kind are non-encoding sequences. 

Preferably, the sequences are non-species specific, thereby constructing 
models which are non-species specific. 



Brief Description of the Drawings 

10 For a better xmderstanding of the invention and to show how the same 

may be carried into effect, reference will now be made, purely by way of 
example, to the accompanying drawings. 

With specific reference now to the drawings in detail, it is stressed that 
the particulars shown are by way of example and for purposes of illustrative 

15 discussion of the preferred embodiments of the present invention only, and are 
presented in the cause of providing what is believed to be the most useful and 
readily understood description of the principles and conceptual aspects of the 
invention. In this regard, no attempt is made to show structural details of the 
invention in more detail than is necessary for a fimdamental understanding of 

20 the invention, the description taken with the drawings making apparent to those 
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skilled in the art how the several forms of the invention may be embodied in 
practice. In the accompanying drawings: 

Fig. Al is a tree diagram describing general methodologies for pattern 

classification. 

Fig. A2 is a schematic diagram of a prokaryotic gene sequence. 

Fig. A3 is a schematic diagram of a eukaryotic gene sequence. 

Fig. 1 is a simplified diagram showing the interrelationships between 
different modeling and characterization methods related to Statistical Process 
Control and Change Point areas. These areas are related to pattern classification 
and are relevant to the presented invention. The figure specifically shows 
where the present embodiments fit in with the prior art. 

Fig. 2 is a block diagram of a device for monitoring an input sequence 
according to a first preferred embodiment of the present invention. 

Fig. 3a is a context tree constructed firom a simulator in accordance with 
an embodiment of the present invention. 

Fig. 3b is a context tree for 238 E Coli promoters constracted in 
accordance with an embodiment of the present invention. 

Fig. 4 is a simplified flow diagram showing a process of building an 
optimal context tree according to embodiments of the present invention. 

Fig. 5 is a simplified flow diagram showing a process of monitoring 
using embodiments of the present invention, 
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Fig. 6A is a flow diagram showing the procedure for building up nodes 
of a context tree according to a preferred embodiment of the present invention. 

Fig. 6B is a variation of the flow chart of Fig. 6A which carries out tree 
growth to reach a predetermined depth faster, 

5 Figs. 7-11 are simplified diagrams of context trees at various stages of 

their construction. 

Fig. 12 is a state diagram of a stochastic process that can be monitored 
to demonstrate operation of the present embodiments, and 

Figs. 13-23 show various stages and graphs in modelmg and 
10 attemptmg to control the process of Fig. 12 accordmg to the prior art and 
according to the present invention. 



Description of the Preferred Embodiments 

In the present embodiments, a model-generic pattern classification 
15 method and apparatus are introduced for the control of state-dependent data. 
The method is based on the context-tree model that was proposed by Rissanen 
(1983) for data-compression piirposes and later for prediction and identification 
(see Weinberger, Rissanen and Feder (1995)). The context-tree model 
comprises an irreducible set of conditional probabilities of output symbols 
20 given their contexts. It offers a way of constructing a simple and compact 
model to a sequence of symbols including those describing complex, non-linear 
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processes such as HMM of higher order. An algorithm is used to construct a 
context tree, and preferably, the algorithm of the context-tree generates a 
minimal tree, depending on the input parameters of the algorithm, that fits the 
data gathered. 

The present embodiments are based on a modified context-tree that 
belongs to the above category. The suggested model is different from the 
models discussed in the backgroxind in respect of 

i) in its construction principles: 

ii) its ability to find and compute what we call "partial contexts'' and 
their probabilities - that were found to have vital importance in 
Biology classification applications: and 

iii) in the suggested distance measures between different trees. 

In order to monitor the statistical attributes of a process, the first 
embodiments compare two context trees at any time during monitoring. For 
example, in certain types of analysis, in particular DNA analysis, it is possible 
to divide a sequence into subsequences and build trees for each. Thus, it is 
possible to compare several pairs of trees at once with a monitor tree and a 
reference tree formed firom monitor and reference data respectively for each 
pair. The comparison may be carried out by comparing the likelihood density 
of the given sequence according to each of the trees and then classifying the 
sequence upon a given threshold. The first context tree is a reference tree that 
belongs to a certain type of class, that is to say a model of how the classified 
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data is expected to behave. There might be several context trees of this type - 
each of which belonging to a known class of sequencesm, such as coding, 
promoters etc. The second context tree is a monitored tree, generated 
periodically from a sequence of an unknown class, which needs to be 

5 classified. The tree parameters are often unknown and need to be estimated. A 
preferred embodiment uses maximum likelihood estimates and likelihood ratios 
(or log-likelihood ratios) to measure a relative 'distance' between these two 
trees with respect to a user-predetermined threshold. There are number of 
statistics that can be used in addition to or as an alternative to likelihood ratios, 

10 as will be explained in greater detail below. 

As will be explained below, in a first stage in certain of the 
embodiments, such as for DNA analysis, a string may be divided into a 
plurality of substrings for each of which a tree is built Then several pairs of 
these trees are compared simultaneously wherein one of the trees in each pair is 
15 a reference tree and is generated from a reference or training data set, and the 
monitored tree is generated from the monitored data set. 

In DNA applications in general, in the first stage groups of strings are 
selected that share conunon properties or functionality - such as 
promoters/binding-sites/exons and introns/coding vs non-coding/amino-acids 
20 that have the same secondary (or higher) structure/proteins or enzymes that 
have certain fimctionally - e.g., effects on patient health etc. The groups are 
taken from a training or learning set.- From the training set a tree model is built 
for each group of strings. In the second stage, however, the tree model is 
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generally used for RECOGNITION or PREDICTION over a "test set" of 
strings. Thus, it is possible to recognize if a given string belongs to a group of 
promoters/coding DNA/noncoding DNA/certain group of proteins with certain 
important properties/ etc or the model may try to predict certain properties of a 
given string (e.g., the secondary structure of a given sequence of amino acids). 
Usually this is done by computing the likelihood of that given string based on 
the tree models, thus, if (*) Pr{string| Tree No 2}> Pr{string| Tree No 1}> 

Pr{stringl Tree No 4}> we can say that the most likely that the string is 

recognize to belong to the group that is described by Tree 2 etc. In fact such a 
query is essentially Bayesian estimation - in the general case if we have an 
apriori knowledge regarding the distribution of the groups, d enoted by P{Tree 
Model}, in the set - then the likelihood are computed by Bayes theorem: 
Pr{Tree model] string}=P{string| Tree model}*P{Tree-model}/P{string}. 
Sometimes, if a priori knowledge of the distribution of groups in the data is not 
available, then we may assign a uniform probability to all P{Tree Model} 
which is equivalent to using the simpler fomi (*) of the likelihood function. 

In certain of the embodiments. Similar models may differ based on 
certain changes of the model construction or the classification algorithm, such 
as: i) position dependence/inhomogeneous models; ii) mixed backward / 
forward algorithms; iii) permutation and reorganization of input strings - thus, 
adding outside "information"; iv) type of levels - e.g., amino acids / 
nucleotides/ proteins, v) divisions of substrings, etc. - some of these modified 
models are described below. 
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A preferred embodiment of the present invention, hereinafter Context- 
Tree classification (CTC) has several particular advantages. Firstly, the 
embodiment learns the dynamics and the underlying distributions within a data 
string being monitored as part of model building. Such learning may be done 
without requiring a priori information, hence categorizing the embodiment as 
model-generic. Secondly, the embodiment extends the current limited scope of 
pattern classification applications to state-dependent classes with varying 
length order and partial leafs, as will be explained in more detail below. 
Thirdly, the embodiment provides convenient monitoring of discrete data. 

A second embodiment uses the KuUback-Leibler (KL) statistic (see 
KuUback (1978)) to measure a relative distance between the two compared 
trees and derive an asymptotic distribution of the relative distance. Monitoring 
the KL estimates with respect to the asymptotic distribution, indicates whether 
there has been any significant change of characteristics in the input data. 

Other embodiments measure the stochastic complexity, or other statistic 
measures to measure a relative distance between the two compared trees and 
derive an asymptotic distribution of the relative distance. Monitoring the 
analytic distribution of the stochastic complexity, indicates whether there has 
been a significant change in the characteristics of the input data that requires a 
different classification. An advantage of the second embodiment over the first 
one is that it sometimes requires less monitored data in order to produce 
satisfactory results. 
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Other possible statistics that may be used to measure the distance 
between tree models include the following: 

Wald's sequential probability ratio testing (SPRT): Wald's test is 
implemented both in conventional CUSUM and change-point methods 
5 and has analytical bounds developed based on the type-1 and type-U 
errors. The advantages of this statistic are that one can detect the exact 
change point and apply a sequential sample size comparison between the 
reference and the monitored tree. 

MDL (Mimmal Description Length): The MDL is the shortest 
10 description of a given model and data string by the minimum number of 
bits needed to encode them. Such a measure may be used to test whether 
the reference 'm-control' context-tree and the monitored context-tree are 
from the same distribution (see Rissanen (1999)). 

Multinomial goodness of fit tests: Several goodness of fit tests 
15 may be used for multinomial distributions. In general, they can be 
applied to tree monitoring since any context tree can be represented by a 
joint multmomial distribution of symbols and contexts. One of the most 
popular tests is the Kohnogorov-Smimov (KS) goodness of fit test. 
Another important test that can be used for CC is the Andersen-Darling 
20 (AD) test (Law and Kelton (1991)). This test is superior to the KS test 
for distributions that mainly differ in their tail (i.e.. it provides a 
different weight for the tail). 
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Weinberger's Statistic: Weinberger et al (1995) proposes a 
measure to determine whether the context-tree constructed by context 
algorithm is close enough to the "true" tree model (see eqs. (18), (19) in 
their paper). The advantage of such a measure is its similarity to the 
5 convergence formula (e.g., one can find bounds for this measure based 
on the convergence rate and a chosen string length N). However, the 
measure has been suggested and is more than adequate for coding 
purposes since it assumes that the entire string N is not available. 

Before explaining at least one embodiment of the invention in detail, it 
10 is to be understood that the invention is not limited in its application to the 
details of construction and the arrangement of the components set forth in the 
following description or illustrated in the drawings. The invention is applicable 
to other embodiments or of being practiced or carried out in various ways. 
Also, it is to be understood that the phraseology and terminology employed 
15 herein is for the purpose of description and should not be regarded as limiting. 

Reference is now made to Fig. 1, which is a chart showing 
characterization of SPC methods. We note that the embodiment of the context- 
tree classification (CTC) is related to the embodiment of context-based 
statistical process control (CSPC). In fact, both embodiments are based on the 
20 suggested context-tree model, however, each of which has a dilBFerent area of 
applications - CTC for pattern classification and CSPC for statistical process 
control — we thereafter use these terms interchangeably. Fig. 1 shows how the 
context-based SPC (CSPC) and, thus, the CTC embodiments of the present 
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invention relate to existing methods of SPC methods. As discussed above in the 
background, data sequences can be categorized into independent data and 
interrelated data, and each of these categories can make use of model specific 
and model generic methods. The embodiments of the present invention 
denoted CSPC/CTC are characterized as providing a model generic method for 
interrelated data. 

Reference is now made to Fig. 2, which is a simpUfied block diagram 
showing a generalized embodiment of the present invention. In the embodiment 
of Fig. 2, an input data sequence 10 anrives at an input buffer 12. A stochastic 
modeler 14 is able to use the data arriving in the buffer to build a statistical 
model or measure that characterizes the data. The building process and the 
form of the model will be explained in detail below. 

The modeler 14 does not necessaiy build models for all of the data. 
During the course of processing, it may build a single model or it may build 
successive models on successive parts of the data sequence. The model or 
models are stored in a memory 16. A comparator 18 comprises a statistical 
distance processor 20, which is able in one embodiment to make a statistical 
distance measurement between a model generated from current data and a 
prestored model. In a second embodiment the statistical distance processor 20 
is able to make a statistical distance measurement of the distance between two 
or more models generated from different parts of the same data. In a third 
embodiment, the statistical distance processor 20 is able to make a statistical 
distance measurement of the distance between a pre-stored model and a data 



63 



wo 02/067075 PCT/IL02/00131 

sequence. In either embodiment, the statistical measure is used by the 
comparator 18 to determine whether (or not) a statistically significant change in 
the data characteristics has occurred. 

As will be described below, the comparator 18 may use the log- 
5 likelihood ratio or the KL statistical distance measure. KL is particularly 
suitable where the series is stationary and sufficiently long. Other measures are 
more appropriate where the series is space or otherwise dependent. 

Reference is now made to Fig. 3, which is a sbnplified diagram showing 
a prior art model that can be used to represent statistical characteristics of data. 
10 In this section, we introduce the context trees model for state-dependent data 
and the concepts of its construction algorithm following the defmitions and 
notations in Rissanen (1983), Weinberger, Rissanen and Feder (1995) and Ben- 
Gal et al. (2000, 2001). A detailed walk-through example presenting the 
context-tree construction is given in Figures 7-1 1 and Tables Al A2. 

15 Consider a sequence (strmg) of observations =Xi,...,X;^, with 

elements t = l,..,N defined over a finite symbol set, X, of size d. In practice, 
this string can represent a realization sequence of a discrete variable drawn 
fi-om a fmite-set. Particularly, the discrete variable can be a queue length in a 
queuing system, such as the number of parts in a buffer in a production line. 

20 For a finite buffer capacity c, the 'finite symbol sef (of possible buffer levels) is 
X = {0,1,2,..., c} and d, the symbol-set size, is thus equal to rf=c+l. For instance, 
the string = 1,0,1,2,3,3 represents a sequence of six consecutive observations 
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of the buffer level (number of parts) in a production line with buffer capacity of 

A family of probability measures P^(jc^), = 0,1,... is defined over the 
set {x^} of all stationary sequences of length N, such that the marginality 
5 condition 

I^pU-^-hpA^") (2-1) 

xeJr 

holds for all N; x^x = x„...,x^,x; and Po(^°)=l where x° is the empty 
string. For simplification of notations, the sub-index will be omitted, so that 
P,(x'')=p(x-). 

10 One could opt to find a model that assigns the probability measure (2.1). 

A possible finite-memory source model of the sequences defined above is the 
Finite State Machine (FSM), which assigns a probability to an observation in 
the string based on a finite set of states. Hence, the FSM is characterized by the 
transition function, which defines the state for the next symbol, 

15 s{x^^')=f(s{x^]xj,,,) (2.2) 

where 5(]c^)€r are the states with a finite state space |r| = S; 5(r°)=5o 
is the initial state; and /rPx ->r is the state transition map of the machme. 
The FSM is then defined by S'{d-l) conditional probabilities, the initial state 
sq, and the transition fiinction. The set of states of an FSM should satisfy the 

65 



^,,„„ PCT/IL02/00131 

WO 02/067075 

requirement that the conditional probability to obtain a symbol given the whole 
sequence is equal to the conditional probability to obtain the symbol given the 
past state, implying that 

p(x|x-)=p(xU{x'')). (2-3) 

5 A special case of FSM is the Markov process. The Markov process 

satisfies (2.2) and is distinguished by the properly that for a Ath-order Markov 
process 4:*'')=^Ar,-,xy,_ft+i. Thus, reversed strings of a fixed length k act as 
source states. This means that the conditional probabiUties of a symbol given 
all past observations (2.3) depend only on a fixed number of observations K 

10 which defines the order of the process. However, even when k is small, the 
requirement for a fixed order can result in an inefficient estimation of the 
probability parameters, since some of the states often depend on shorter strings 
than the process order. On the other hand, increasing the Markov order to find a 
best fit results in an exponential growtii of tihie numba: of states, S = rf* , and, 

1 5 consequently, of the number of conditional probabilities to be estimated. 

An alternative model to the Markovian is the context-tree that was 
suggested by Rissanen (1983) for data compression purposes and modified 
later in Weinberger, Rissanen and Feder (1995). The tree presentation of a 
finite-memory source is advantageous since states are defined as contexts - 
20 graphically represented by branches in the context-tree with variable length - 
and hence, requires less estimation efforts than those required for a Markov 
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presentation. The context-tree is an irreducible set of conditional probabilities 
of output symbols given their contexts. The tree is conveniently estimated by 
context algorithm. The algorithm generates an asymptotically minimal tree 
fitting the data (Weinberger, Rissanen and Feder (1995)). The attributes of the 
5 context-tree along with the ease of its estimation make it suitable for a model- 
generic classifier, as seen later. 

A context, J^'\ in which the "next'' symbol in the string occurs is 
defmed as the reversed string (we use the same notation for contexts as for the 
FSM states, smce here, they follow similar properties), 

for some ifc^O, not necessarily the same for all strings (the case fc=0 is 
interpreted as the empty string ^o)- string is truncated since the symbols 
observed prior to do not affect the occurrence probability of x^+i . For 

the set o{ optimal contexts, r = {^: shortest contexts satisfying (2.3)}, k is selected 

15 to attain the shortest contexts for which the conditional probability of a symbol 
given the context is practically equal to the conditional probability of that 
symbol given the whole data, i.e., nearly satisfying (2.3). Thus, an optimal 
context, 5 e r, acts as a state of the context-tree, and is similar to a state in a 
regular Markov model of order fc However, unlike the Markov model, the 

20 lengths of various contexts do not have to be equal and one does not need to fix 
k such that it accounts for the maximum context lengtfi. The variable context 
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lengths in the context-tree model result in fewer parameters that have to be 
estimated and, consequently, require less data to identify the source. It is noted, 
however, that the optimal contexts model does not necessarily satisfy equation 
(2.2), since the new state si^^""^) can be longer than si^^) by more than one 
5 symbol (see Weinberger, Rissanen and Feder (1995)). 

Using the above definitions, a description of the context-tree follows. A 
context-tree is an irreducible set of probabilities that fits the symbol sequence 

jc^ generated by a finite-memory source. The tree assigns a distinguished 
optimal context for each element in the string, and defines the probability of the 
10 element, x^ , given its optimal context. These probabilities are used later for 
classification and identification - comparing between sequences of 
observations and identifying whether they belong to the same class. 
Graphically, the context-tree is a ^-ary tree which is not necessarily complete 
and balanced. Its branches (arcs) are labeled by the different symbol types. 
15 Each node contains a vector ofd conditional probabilities of all symbols x^X 
given the respective context (not necessarily optimal), which is represented by 
the path firom the root to that specific node. An optimal context ^ e r of an 
observation is represented by the path starting at the root, with branch Xf 
followed by branch jc,_i and so on, until it reaches a leaf or a partial leaf 

20 Figure 3 exemplify a context-tree that was constructed fi-om a sequence 

of observed buffer levels in a production line. Since in this case the buffer has a 
finite capacity of c = 2, there are = 3 symbol types, where observation, 
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X, e {0,1,2}, refer to the number of parts in tfie buffer at time t. Using the 
context algorithm, 5=5 optimal contexts are found (marked by bolded frame), 
tiius, the set of optimal contexts is a collection of reversed strings 
r = {0,2,102^010,10101} (read from left to right). The context 1010 is a partial 
5 leaf. 

Consider the string x' =1,2,0,1,0,1, which is generated from the tree 
source in Figure 3. Employing the above definitions, the optimal context of the 
next element, = 0 , is s{x' )= 1,0,1,0 , i.e., following the reverse string from the 
root untU reachmg an optimal context Accordmgly, the probability of xj given 
10 the context is p(x, =o|s(x'))=0.33. Note that had we used a Markov chain 
model with maximal dependency order, which isk=5 (the longest branch m 
the tree), we would need to estimate the parameters of 3* =243 states (instead 
of tiie five optimal contexts in the context-tree of Figure 2), although most of 
them are redundant. 

15 The conditional probabilities of symbols given the optimal contexts, 

p{x\s) xeX,ser, iand the marginal probabilities of optunal contexts 
P{s), jr e r are estimated by the context algorithm. The jomt probabilities of 
symbols and optimal contexts, Pix,s\xGX,ssr , represent the context-tree 
model and are used to derive the classifying algorithm. This model might be 

20 only an approximated description of the real generating source, but it is often 
appropriate for practical purposes. 
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Reference is now made to Fig. 4, which is a simplified schematic 
diagram showing stages of an algorithm for producing a context tree according 
to a first embodunent of the present invention. The construction algorithm of 
Fig. 4 is an extension of the Context algorithm given in Weinberger, Rissanen 
5 and Feder (1995). The algorithm preferably constructs a context-tree fiom a 
string of symbols and estimates the marginal probabilities of contexts and the 
conditional probabilities of symbols given contexts. The algoriflmi comprises 
five stages as follows: two concomitant stages of tree growing 42 and 
iteratively counter updating and tree pruning 46; a stage of optimal contexts 
10 identification 48; arid a stage of estinaating context-tree probabiUty parameters 
50. 

In tiie tree growing stage 42, a counter context-tree, T, 0 < / :< iV , is grown up 
to a maximum depth m. Each node m T, contains d counters - one for each 
symbol type. The counters, nix\5), denote the conditional firequencies of the 

15 symbols xeX in the string x' given the context s. Concomitantly with the tree 
growth stage 42, the coimter updating and tree pruning stage 46 ensures that the 
counter values n(x|5) are updated according to symbol occurrences as will be 
explained in more detail hereinbelow. The counter context tiree is iteratively 
pruned along with counter updating to acquire the shortest reversed sti-mgs, 

20 thereby in practical terms to satisfy equation 2.3, it being noted that exact 

equality is not achieved. In tiie following stage, selection of optimal contexts 

48, a set of optimal contexts r is obtained, based on the pruned counter context 

tree. In the estimation stage 50, the estimated conditional probabilities of 
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symbols given optimal contexts p{y\s\ jce^ * eF and the estimated marginal 
probabilities of optimal contexts Pis), s e F are derived. As discussed in more 
detail hereinbelow, both p(x|s) and are approximately multinomially 
distributed and used to obtain the CSPC control limits. The estimated joint 
5 probabilities of symbols and optimal contexts, P(,x,s) = P(:x\s) Pis), 
x e e r , are then derived and represent the context-tree in its final form. It 
is noted that the term "counter context-tree " is used to refer to the model as it 
results from the firet three stages in the algoriftm and the term "context-tree " is 
used to refer to the result of the final stage, which tree contains the final set of 
10 optimal contexts and estimated probabilities. 

Returning now to Fig. 2, and once a model is obtained for incommg 
data, the model is compared by comparator 18 with a reference model, or more 
than one reference model, which may be a model of earlier received data such 
as training data or may be an a priori estimate of statistics for the data type in 
1 5 question or the like. 

In the following, examples are given based on measuranents 
using the KL statistic. However, the skilled person will appreciated that other 
statistical measures may be used, including but not restricted to those 
mentioned hereinabove. 
20 KuUback (1978), the contents of which are hereby incorporated by 

reference, proposed a measure for tiie relative 'distance' or the discrimination 
between two probability mass fimctions g(x) and Qo W • 
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The measure, now known as the Kullback Liebler (KL) measure, is 
positive for all non-identical pairs of distributions and equals zero iff (if and 
only if) Q{x)=Qq{x) for every x. The KL measure is a convex function in the 
5 pair (Q{x\Qo{x)), and invariant under all one-to-one transformations of the 
data. Kullback has shown that the KL distance (multiplied by a constant), 
between a <^-category multinomial distribution Qix) and its estimated 
distribution q(jc), is asymptotically chi-square distributed with dA degrees of 
freedom: 

10 2Ar.xte(4fi(4- E ^^^^^^^-^2-^' 



where N is the size of a sample taken from the population specified by 
Q(x); nix) is the frequency of category (symbol type) x in the sample, 
J^n{x)=N; and Qix)=n{x)/N is the estimated probability of category 

15 (symbol type) X. 

The KL measure for the relative 'distance' between two joint probability 
mass fimctions Q(x^) and Qoipc^y) can be partitioned into two terms, one 
representmg the distance between the conditioning random variable and the 
other representing the distance between the conditioned random variable: 
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In the present embodiments the comparator 18 preferably utilizes the KL 
measure to determine a relative distance between two context-trees. The first 

5 tree, denoted by Pi(x,s), represents the monitored distribution of symbols and 
contexts, as estimated from a string of length N at the monitoring time / = 1,2,... 
The second tree, denoted by Poix.s), represents the *in-contror reference 
distribution of symbols and contexts. The reference distribution is either known 
a priori or can be effectively estimated by the context algorithm from a long 

10 string of observed symbols as will be discussed in greater detail below. In the 
latter case, the number of degrees of freedom is doubled. 

tilizing what is known as the minimum discrimination information (MDI) 
principle (see Alwan, Ebrahimi and Soofi (1998)), the contents of which are 
herein incorporated by reference, the context algorithm preferably generates a 
15 tree of the data being monitored, the tree having a similar structure to that of 
the reference tree. Maintaining the same structure for the current data tree and 
the reference tree permits direct utilization of the KL measure. 

Now, new observations are constantly being collected and may be used 
for updating the current data tree, in particular the counters thereof and thus 
20 updating the statistics represented by the tree. A significant change m the 
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Structure of the tree may be manifested in the tree counters and the resulting 
probabilities. 

Using equation (33) above, it is possible to decompose the KL 
measured distance between the current data context-tree and the reference 
5 context-tree (both represented by the joint distributions of symbols and 
contexts) into a summation involving two terms as follows: 

(3.4). 

Of the two terms being summated, one measures the KL distance 
10 between the trees' context probabilities, and the other measures the KL 
distance between the trees' conditional probabilities of symbols given contexts. 

Under the null hypothesis that the monitored tree PX^.s) is generated 
from the same source that generated, /^^(x,^) and by using the multinomial 
approximation referred to above, it is possible to derive an asymptotic 
15 probability density function of the KL measure between PXx,s)md Fq(x,s), 
i.e.. 
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(3.5) 



where n(s) is the frequency of an optimal context j e r in the string; N is 
the size of the monitored string; S is the number of optimal contexts; and d is 
the size of the symbol set. As mentioned above, if the reference tree has to be 

5 estimated, the number of degrees of freedom may be doubled. Thus, the KL 
statistic for the joint distribution of the pair (X,r) is asymptotically chi-square 
distributed with degrees of freedom depending on the nvunber of symbol types 
and the number of optimal contexts. The result is of significance for the 
development of control charts for state-dependant discrete data streams based 

10 on the context-tree model. 

Now, given a type I error probability a, the control limits for the KL 
statistic are given by. 



Thus, the upper control limit (UCL) is the lOO(l-a) percentile of the 
1 5 chi-square distribution with {Sd-l) degrees of freedom. 

The control limit (3.6) has the following, advantageous, characteristics: 
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i) It is a one-sided bound; if the KL value is larger than the UCL, 
the process is assumed to be 'out-of-control' for a given level of 
significance. 

ii) The control limit lumps together all the parameters of the 
5 context-tree, in contrast with traditional SPC where each process 

parameter is controlled separately. Nevertheless, the KL statistic of the 
tree can be easily decomposed to monitor separately each node in the 
context-tree. This can be beneficial when looking for a cause of an 'out- 
of-control' signal. 

^0 iii) If Sd is large enough, the KL statistic is approximately 

normally distributed. Hence, conventional SPC charts can be directly 
applied to monitor the proposed statistic. 

A basic condition for applying the KL statistic to sample data requires 
that P^{As) > 0, VxeX,Vser . Such a constraint may be satisfied witii the 
15 predictive approach, i.e.. 



where all probability values assigned to any of the symbol types are 
strictly positive, in contrast to the non-predictive approach'. 
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by defining ^^0. The choice among these alternative procedures, 

depends both on the knowledge regarding the system states and on the length of 
the string used to construct the context-tree. However, in the latter non- 
5 predictive case, the number of degrees of freedom is adapted according to the 
number of categories that are not equal to zero, thus, subtracting the zero- 
probability categories when using the non-predictive approach. 

Another example for a distance measure between two tree models is the 
use of lo ff-likelihood ratios . This example considers an application of pattern 
10 recognition of E. coli promoters. The details of the experiment are listed as 
follows. 

TTie 238 DNA strings of size 12 from a given database were converted 
to strings of numbers, and encode such that: "A*^ 1; "C"=2; "G"=3;"T"=4. A 
special version of the context-tree construction algorithm "contl2" was 

15 adapted for DNA sequences of size 12. It was adapted such that, the tree 
constructiott wUl use only contexts of length up to 5, and that the context buffer 
wUl be reset every time it reaches the size eleven. Thus, effectively, each 
appearance of a size 12 promoter wiU update the statistics of the context tree, 
and grow a tree with up to a depth of 5 levels (maximum context length of 5). 

20 The tree construction parameters were not optimized since it was conducted 
mainly for illustration purpose. 
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Reference is now made to Fig. 3b. The 238 number stings were 
concatenated to one large string "s". A 5 levels context tree was identified jfrom 
the 238 promoter DNA by using defalt tree parameters. The likelihood of 
subsequences given the context trees can be computed for promoter and 
nonpromoter sequences. For example, the probability of the string GCTTA, 
according to the context tree in Fig 3 b, is calculated by parssing the string to 
identified contexts P{GCTTA} = P(G) * P(C|G) * P(T|GC) * P(T|GCT) * 
P(A|GCTT) = (see the context tree in Fig 3b) = P(G) * P(C|G) ♦ P(T) * P(T|T) 
* P(A|CTT) = 0.4136 * 0.2164 * 0.3988 * 0.3589 * 0.5385. 

As explained above, the distance measure between two models for a 
given string is obtain by computing the likelihood of that given string based on 
the tree models, thus, if (*) Pr{string| Tree No 2}> Pr{strmg| Tree No 1}> 

Pr{string| Tree No 4}> one can say that the most likely that the string is 

recognize to belong to the group that is described by Tree 2 etc. 
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Reference is now made to Fig. 5A, which is a simplified flow chart 
illustrating the control procedure used by the device of Fig. 2 to control a 
process or the like. In Fig. 5, a first stage 60 comprises obtaining a reference 
context-tree, Poix.s). This may be done analytically or by employing the 
5 context algorithm to a long string of representative data, for example from a 
training set, or the model may be obtained from an external source. 

In a second stage 62, a data source is monitored by obtaining data from 
the source. At succeeding points, a data sample is used to generate a current 
data tree Piix^s) from a sample of sequenced observations of size N. The 

10 sample size preferably complies with certain conditions that will be discussed 
in detail herein below. The sequences can be mutual-exclusive, or they can 
share some data points (often this is referred to as ^'sliding window^^ 
monitoring). The order of the sequence can be reorganize or permute in various 
ways to comply with time-dependent constraints or other type of side- 

15 information, which is available. Each sequence used to generate a model is 
referred to herein below as a "run" and contributes a monitoring point in the 
CSPC chart or a classification decision to CTC, Following the MDI principle 
referred to above, the structure of the current data tree is selected to correspond 
to the structure of the reference context-tree. .Once the structure of the tree has 

20 been selected, then, in a model building stage 64, the counters of the current 
data context tree are updated using values of the string, and probability 
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measures of the monitored context-tree are obtained, as will be explained in 
greater detail below. 

Once the model has been built then it may be compared with the 
reference model, and thus the (log) likelihood ratio or the KL value can be 

5 calculated to give a distance between the two models in a step 66. As 
mentioned above, the log likelihood ratio is compared to a user predefined 
value. The KL value measures a relative distance between the current model 
and thus the monitored distributions ^(jc,^), and the reference distributions 
Po(x,5) as defined in the reference model. In some cases it might be valuable 

10 to use several distance measures simultaneously and interpolate or average 
their outcomes. 

Referring to the CTC the query step 68 indicates whether the obtain 
statistic value point to one of the classes. When considering the CSPC, the KL 
statistic value can plotted on a process control chart against process control 
15 limits in a query step 68. The control limits may for example comprise the 
upper control limit (UCL) given m equation (3.6) above. If the KL value is 
larger than the UCL it indicates that a significant change may have occurred in 
the process and preferably an alarm is set. 

The process now returns to step 62 to obtain a new run of data, and the 
20 classifying process is repeated until the end of the process. 

Considering Fig. 5A in greater detail, the data sample obtained at stage 
62 may be considered as a sequence of observations =x,,...,X;^, with 
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elements x, t = \,..,N dejBned over a finite symbol set, X, of size d. In stage 64, 
a primary output is a context tree for the sample, which context tree 
contains optimal contexts and the conditional probabilities of symbols given 
the optimal contexts. Namely, it is a model of the incoming data, incorporating 
5 patterns in the incoming data and allowing probabilities to be calculated of a 
likely next symbol given a current symbol. 

Reference is now made to Fig. 5B, which is a shnplified flow chart 
showing how the same measurement may be carried out using stochastic 
complexity. A reference tree is initially obtained in step 60A. Then a data 
10 sample is obtained in step 62A. Stochastic complexity is calculated in step 
64A and control limits are calculated in step 66A. Finally, the sample values 
are tested in step 68A to determine how to classify the sequence or whether the 
stochastic complexity is within the control limits. 

Reference is now made to Fig. 6A, which is a simplified flow chart 
15 showing an algorithm for carrying out stage 64 in Fig. 5, namely building of a 
context tree model based on the sample gathered in step 62. More specifically. 
Fig. 6A corresponds to stages 42 and 46, in Fig. 4. The tree growing 
algorithm of Fig. 6A constructs the tree according to the following rules (the 
algorithm depends on parameters that can be modified and optimized by the 
20 user): 

A stage SI takes a single root as the initial tree, Tq , and all symbol 
counts are set to zero. Likewise a symbol counter t is set to 0. 
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A Stage S2 reads a {t+lf' symbol from the input sequence, thus, in 
the first iteration , = xi is being read. 

In stage S3 the algorithm begins a process of tracing back from a root 
node to the deepest node in the tree. If i=0 then tracing remains within the 
same root, otherwise the algorithm chooses the branch Tt representing 
symbol Xm+i. Stage S4 is part of the traceback process of step S3. As each 
node in the tree is passed, the counter at that node, corresponding to the 
current symbol, is incremented by one. 

In step S5, the process determines whether it has reached a leaf, i.e,, a 
node with no descendents nodes. If so, the process continues with S6, 
otherwise it returns to S3. 

S6 controls the creation of new nodes. S6 checks that the last updated 
counter is at least one and that ftirther descendents nodes can be opened. It 
will, for example, detect a counter set to zero in step S8. Preferably, the last 
updated count is at least 1, i<m(the maximum depth) and t-^. Step S7 
creates a new node corresponding to Xt-i+i- Step S8 generates one counter 
with value 1 and the other counters with zero value at new node creation. 
Those values may be detected by S6 when another symbol is read. Step SIO 
controls the retracing procedure needed to stimulate tree growth to its 
maximal size, by testing i<m and t-l>0 and branching accordingly. 

Once a leaf has been reached in S5 or SIO, then the traceback 

procedure is complete for the current symbol. The maximal allowed deepest 

node is set at an arbitrary limit (e.g. 5) to limit tree growth and size, and save 
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computations and memory space. Without such a limit there would be a 
tendency to grow the tree to a point beyond which it is very likely to be 
pruned in any case. 

Stage S6 thus checks whether the last updated count is at least one 
and if maximum depth has yet been reached. If the result of the check is 
true, then a new node is created in step S7, to which symbol counts are 
assigned, all being set in step S8 to zero except for that corresponding to the 
current symbol, which counter is set to 1. The above procedure is preferably 
repeated until a maximum depth is reached or a context is reached 

in stage SIO. Thereafter the next symbol is considered in stage S2. 

More specifically, having recursively constructed an initial tree 

from an initial symbol or string , the algorithm moves ahead to consider 
the next symbol x^+i. Then tracing back is carried out along a path defined 
by and in each node visited along the path, the counter value of 

the symbol x,^j is incremented by 1 until the tree's current deepest node, say 
jc^,...,x^_/+5, is reached. Although not shown in Figs. 6A and 6B, an initial 

string preceding jc' may be used in order to accotmt for initial conditions 
(see Rissanen 1983, the contents of which are hereby incorporated by 
reference). 

If the last updated coxmt is at least 1, and l<m, where m is the 
maximum depth, the algorithm creates new nodes corresponding to 
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/ < r ^ wi ^as descendent nodes of the node defined in S6 The new node is 
assigned a full set of counters which are initialized to zero except for the one 
counter corresponding to the current symbol x^+i, which is set to 1. 
Retracing is continued until the entire past symbol history of the current 
input string has been mapped to a path for the current symbol x^+j or until m 
is reached, r being the depth of the new deepest node, reached for the current 
path, after completing stage S7. 

Reference is now made to Fig. 6B, which is a simplified flow chart 
showing a variation of the method of Fig. 6A. Steps that are the same as 
those m Fig. 6A are given the same reference numerals and are not referred 
to again except as necessary for understandmg the present embodiment. 

In Fig. 6B, the steps S9 and SIO are removed, and step S8 is followed 
directly by step Sll, thereby to reduce the computational complexity. While 
the previous algorithm is more accurate, in this algorithm, the tree grows 
slowly - at most one new node per symbol. Thus m the beginning - when 
tlie tree has not yet grown to its maximal depth - some counts are lost. If the 
sequence length is much longer compared to the maximal tree depth, than 
the difference in the counter values produced by both algorithms will be 
practically insignificant for the nodes left after the pruning process.. 

In order to understand better the algorithms of Fig. 6, reference is 
now made to Figs. 7 -9 which are diagrams of a model being constructed 
using the algorithm of Fig. 6. Further illustrations are given in Tables Al 
and A2. 
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In Figs. 7-9, tree 100 initiaUy comprises a root node 102 and three 
symbol nodes 104-108. Each one of nodes 102-108 has three counters, one for 
each of the possible symbols "a", "b" and "c". The counters at the root node 
give the numbers of appearance of the respective symbok and the counters at 

5 the subsequent, or descendent, nodes represent the numbers of appearances of 
the respective symbol following the symbol path represented by the node itself. 
Thus the node 104 represents the symbol path "a". The second counter therein 
represents the symbol "b?'. The counter being set to 1 means that in the 
received string so fer the number of "b"s following an "a" is 1. Node 106 

10 represents the symbol path "b^ and the first counter represents the symbol "a". 
Thus the first counter being set to "1" means that in the received string so far 
the symbol "a" has appeared once following a "b". The second counter being 
on "0" implies that there are no "b"s followed by "b"s. 

Node 108 represents context "b a" corresponding to the symbol path or 
15 the sequence "a b" (recall that contexts are written in reverse order). The first 
counter, representing "a" being set to "1" shows that there is one instance of the 
sequence "a b" being followed by "a". 

In Fig. 8, a fourtti symbol x^=b is received. The steps S3 to SIO of 
Fig. 6 are now carried out. The symbol b, as preceded by "a b a" in ifaat order 
20 can be traced back fi-om node 104 to 102, (because the traceback covers the "b 
a" sufBx of the sequence). The "b" counters are incremented at each mode 
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passed in the traceback. Likewise the sequence " a b a'' can be traced back 
from node 108 to the root, again incrementing the "b" counters each time. 

In Fig. 9, a new node 110 is added after node 104, representing step S7 
of Fig. 6. The node is assigned three counters as with all previous nodes. The 
5 "b" counter thereof is set to 1 and all other counters are set to "0" as specified 
in step S8 of Fig. 6. It is noted that the context for the new node is "a b", thus , 
representing the sequence "b a". 

Returning now to the tree pruning stage 46 of Fig. 4, it is necessary to 
prune the tree, as will be described below, to obtain what may be referred to as 
10 the optimal contexts of T^^. Tree pruning is achieved by retaining the deepest 
nodes w in the tree that practically satisfy equation 2.3 above. The following 
two pruning rules apply (see Weinbeger, Rissanen and Feder (1995) for further 
details): 

Pruning rule 1: the depth of node w denoted by is bounded by 
15 a logarithmic ratio between the length of the string and the number of 

symbol types, i.e., |w|^log(r+l)/log(rf); and. 

Pruning rule 2: the information obtained from the descendant 
nodes, sb'^b^X, compared to the information obtained from the parent 
node s, is larger than a 'penalty' cost for growing the tree (i.e., of adding 
20 a node). 
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The driving principle is to prune any descendant node having a 
distribution of counter values similar to that of the parent node. In 
particular, we calculate A^{sb), a measure of the (ideal) code-length- 
difference of the descendant node sb, Vfe e X, 



and then require that A;^(w)s:C(fi? +l)log(f+l), wherein 
logarithms are taken to base 2; and 

C is a pruning constant tuned to process requirements (with 
default C = 2 as suggested in Wemberger, Rissanen and Feder 
(1995)). 

The tree pruning process is extended to the root node with 
condition a« (x»)= <» , which condition impUes that the root node itself 
cannot be pruned. 

Reference is now made to Fig. 10, which is a simplified diagram 
showing a pruned counter context-tree 112 constructed by applying the tree 
building of Fig. 6 followed by tree pruning on a string containing 136 symbols 
- eight replications of the sub string: (a.b,a,b,c.a,b.a,b,c,a,b,a,b.a,b.c). The tree 
comprises a root node 114 and five further nodes 116 - 124. By contrast, the 
unpruned tree, from which this was taken, may typically have had three 
decsendent nodes for each node 
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Returning again to Fig. 4, and stage 48, selection of optimal contexts, is 
now described in greater detail. In stage 48, a set of optimal contexts, r, 
containing the S shortest contexts satisfying equation 2.3 is specified. An 
optimal context can be either a path to a leaf (a leaf being a node with no 
5 descendants) or a partial leaf in the tree. A partial leaf is defined for an 
mcomplete tree as a node, which is not a leaf. Now, for certain symbol(s) the 
path defines an optimal context satisfying equation 2.3, while for other symbols 
equation 2.3 is not satisfied and a descendant node(s) is created. The set of 
optimal contexts is specified by applying the foUowmg rule: 



where fi> = 0 is the default value of a user-defined parameter. This means 
that r contains only those contexts that are not part of longer contexts. When 
the mequality in expression 4.2 turns into equality, that context is fully 
15 contained m a longer context and, thus, is not included in F . It is noted that hi 
each level in the tree there is a context that does not belong to a longer context 
and, therefore, does not satisfy equation 4.2. This is generally due to initializing 
conditions. Such inconsistency can be solved by introducing an initiating 
symbol string as suggested in Weinberger, Rissanen and Feder (1995). 

20 In summary, T contains all the leaves in the tree as well as partial 

leaves satisfying equation 4.2 for certain symbols. 



10 




(4.2) 



88 



wo 02/067075 PCT/IL02/00131 

Returning to Fig. 10, and node 122 corresponds to string history or 
context 5=6a and is, as defined above, a partial leaf (acts as an optimal context) 
for symbols a and c. This is firstly because the longer context s—bac does not 
include all symbol occurrences of symbols a and c. Secondly, the contexts bab 

5 and baa were pruned and Iximped into the context ba. Applying equation 4.2 to 
the pruned counter context-tree presented in figure 10 results in four optimal 
contexts r= {a; bac; c; ba}. The first three contexts in T are leaves, the latter 
is a partial leaf and defines an optimal context for symbols a and c. Considering 
now in greater detail stage 50, estimation of parameters in Fig. 4, the estimation 

10 stage is composed of three steps as follows: 

1) the probabilities of optimal contexts are estimated and denoted by 

2) the conditional probabilities of symbols given the optimal contexts 
are estimated and denoted by P(x \s), xsX, s^F; and 

15 3) the estimated joint probabilities of symbols and optimal contexts are 

calculated P(jc,s), x e X, s^F. 

Given the set of optimal contexts and the pruned counter tree, the 
probability of optimal contexts in the tree, P(.s), 5 e r , are estimated by their 
frequency in the string: 
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where n(s) is the sum of the symbol counters in the corresponding 
leaves (or partial leaves) that belong to the optimal context s and not to a longer 
5 context sb BgX , Each symbol in the string thus belongs to one out of S 
disjoint optimal contexts, each of which contains n(s) symbols. An allocation of 
symbols of a sufficiently long string to distinctive optimal contexts can be 
approximated by the multinomial distribution. 

Returning to Fig. 10, and the estimated probabilities of optimal contexts 
10 in figure 6 are given respectively by, 

^{a\p{bac\P{clP{ba)}= {56/136,24/136,24/13632/13^ . 

Once the symbols in the string are partitioned to S substrings of optimal 
contexts, the conditional probabilities of symbol types given an optimal context 
are estimated by their frequencies in the respective substring, 

n{x\sy Z^{x\sb) 
15 ^(^1^)= ^4 Vx,6eA; sgT (4.3) 

where ^ = 0 • The distribution of symbol types in a given optimal context 

is, thus, approximated by another multinomial distribution, 
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'Z{"ix\s)-j:n(x\sb)\ 

- _ xeX\ beX J 
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An alternative predictive approach to equation 4.3 was suggested in 
Weinberger, Rissanen and Feder (1995). It is implemented in cases where one 
needs to assign strictly positive probability values to outcomes that may not 
actually have appeared in the sample string, yet can occur in reality. Thus, 



The choice among the two alternative procedures above depends both on 
the knowledge regarding the system states and on the length of the string used 
to construct the context-tree. The latter approach is suitable for applications 
involving forecasting. 

Reference is now made to Fig. 11, which is a simplified tree diagram 
showing a tree 130 having a root node 132 and five other nodes 134 — 140.. 
The counters now contain probabilities, in this case the estimated conditional 
probabilities of symbols given contexts. The probability estimates are 
generated by applying equation 4.3 to the cotatter context-tree 112 of Fig. 11. 
For example, the conditional probability of a symbol type x e {a,6,c} given the 

context s^a^ is estimated as p(x|a)=(0, 56/56 ,0) = (0,1,0), whereas the 

conditional probabilities of a symbol xe{a,6,c} given the context s = ba is 

estunated as P(jc|i^a)= (8/32 ,0,24/32) = (0.25,0,0.75). The probabilities of symbols 

in non-optimal contexts are also shown for general information. 

As shown in the figure the optimal contexts are { 132,140,136,138}. 




(4.4) 
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Returning now to Fig. 3a tree 30 is the context tree generated from a 
string of length iS^=1088 (64 replications of the basic string 
a,b,a,b,c,a,b,o,b,c,a,b,a,b,a,b,c). The maximum tree depth has increased from 
three to five levels, as opposed to the preceding examples. Nevertheless, the 
number of optimal contexts acting as states has only increased from four to 
five. It is pointed out that had a Markov chain model of order k=5 been used, it 
would have been necessary to estimate transition probabilities between 
3^ = 243 states, most of which are redundant in any case. 

The joint probabilities of symbols and optimal contexts that represent 
the context-tree in its final form are evaluated thus: 
P(x, s) = Fix\s) • Pis), X G ^, 5 e r. 

As explained above with respect to Fig. 5, the model as built in 
accordance with the procedures outlined in Figs. 6-1 1 can be used in the 
comparison stage of Fig. 5 to obtain information about the comparative 
statisticed properties of data sequences. 

The above procedure is sxmunarized and illustrated in the following two 
tables for string xM,4,4,3,3 ,2 where = = {0,1,2,3,4}: 
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Steps 



Tree 



Description 



StepO: To 



(o.Q.oom 



Initialization: the root node, ^^ 
denotes the empty context 



Step 1: Ti 
=4 



The only context for the first symbol 
is X, the counter /<jc=4|-^) was 
incremented by one. 



Step 2: T^ 
x"- =4,4 



(O.0.0.Q.2) 
n(X'=41root)=^2-^,A 



(P.0.0.Q.1) 



n(x=4|s=4)=»l 



The counters n{x=A\X) and 
n(x=A\s=4) are incremented by one. 
The node of the context 5=4 is added 
to accumulate the counts of s»ymbols 
given the context 5=4. 



Step 3: T3 
= 4,4,4 



. ffl.0.0.0.3t 
nCx=4|root)= 



f0.o.o.a2] 



n(x=^|s=4)^ 



(O.Q.O.0.11 



n(x=4|s=44)=l 



The counter of the symbol 4 is 
incremented by one in the nodes 
from the root to the deepest node 
along the path defined by the past 
observations. In this case, the 
counters - r^x^\/i) da\6 n(x=4\s=4) 
are incremented by K A new node is 
added for the context s=44. And 



Stage 4: T4 



. (0.0.0.1.3^ 
n(x=3|root)= 



(0.0.0.1.2] 



n(x=3|s=4)^ 



(0.0.0.1.1) 



n(x=3|s==44)=r\4 



(0.0.0. 1.2) 



The counters - r^x=^\X), n{pc=A\^) 
and n(pc=A\s=AA) are incremented by 
one since the past contexts are s=A, 

5=44. A new node is added for 
the context 5=^444 of the observation 

= 3. 



n(x^3|s=444)-l 



Stages: 
x^=...,4,33 



nCxF3|roat>^ 



0 

fO.0.0.2.3^ 
3 


^.4. 


fO.O.O.1,0^^ 




0 

(0.0.0.1. 2> 



n(x=3 p3p>1^4 n(x=3 | ^)=V4 



[ 



I (0.0.0-1. Q> I I ^".".^.T<if i 

n(x«3|s=34)=! n(?F3|s= 44)^4 



(0.0.0,1.1) 



Add new nodes for the contexts 5=3, 
5=34, 5=344. ... Update the counter 
of the symbol jc^3 from the root to 
tfie deepest node on the patf) of past 
observations. 



(OOP 10^ 



n(x=3|s«444)=I 
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Stage 6: 

=.,.,3,3,2 



n(y=2{root)=l 



(0.0.1.2.3) 



(0.0.2.1 .0 



^J^ (0.0.0.1.2H^ 



. o 




o 






(0.0.0.1 .0^ 




(0.0.0.1.0) 




(0.0.0.1.1? 



n(x-21s=33)=l n(x=3|s=34)=l 



(0.0.0.1 .0) 



n(xF3|s=M44>=l 



Update the counter of the symbol 
7C=0. from the root to the deepest 
node on the path of past 
observations. Add the contexts: 5«33 
and so on. 



Table Al : Tree growing and counter updating stage in context algorithm 
for string x^=4,4,4,3,3,2 



Rule 



Tree 



Description 



Rulel: 



(0.0.1 



.2.3) 



(0.0.1.1.0) 



(0.0.0.1.2 



Rulel: 

Maximum tree Depth ^ log(ryiog(flO ^ log(6yiog{5)=l . 1 1 
The maximum tree depth is of level one. Thus, all nodes 
of level 2 and below are trimmed. 



Rule 2: 



(0.0.1 .2.3> 



Rule 2: for the rest of the nodes in level one and the root. 



we apply trimming rule 2. The tiireshold for 0=2 is: 

A6(m)> 2(<i+l)log(r+l)=33.7 

And for each of the nodes: 

A« (^6 = A3)= 0 + 0 + llog|^^ j + Mog^^ j+ 0 2.17 

A, (56 = ;i4) = 0 + 0 + 0 + 1. log(^^] + 2 • >og[||] = 83 

The code-length difference is below the threshold, hence 
the first level nodes are trimmed. 



Table A2: Pruning stage in context algorithm for the string 



x^=4,4,4,3,3,2 



Tuning of the Model for Specific Conditions 

The use of context trees as modelers for analysis and pattern 
recognition, in particular for biological data sets, may be modified in various 
ways to improve performance. It is noted that the various parameters of context 
tree model can be used to optimize performance for different pattern 
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recognition problems, by using side information specific to the problem, and by 
using empirical and numerical tests. The flexibility of the modelers is too large 
to be indicated fully here. Nevertheless, in the following list several optional 
modifications of the model parameters are indicated for specific experimental 
5 conditions: 

Combination of several separate context tree models: several 
fixed string length models are tuned together. That is to say each 
model is tuned on a different part of a long input string, through 
manipulation of the context tree algorithm with limited size buffers 

10 and with jumping over irrelevant string segments. Then, instead of 
using a given string length, the string is divided into segments and 
each segment is handled by a different context tree model. For 
example, in the E-coli promoter-recognition problem, it is known 
that most promoters have two different recognition sites, each having 

15 six base-pairs (bp) and separated by an intermediate segment of 
variable length. Therefore, one can modify the input string length 
for training and identification. 

For a model based on a string length of 12 bp (and without 
implementing other optimization procedures — such as non- 
20 homogeneous trees), a context tree was experimentally identified that 
yielded a True Positives value of 69.75% and a True Negatives value 
of 67.94% using a cross-validation set. Then by using the same non- 
optimized model which is based on two separated context trees, each 
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of which was for input strings of 6 bp, the results improved to a True 
Positives value of 71.43% and a True Negatives value of 72.01% for 
the cross-validation set. Optimizing the other parameters of the tree 
model, enabled a dramatic improvement to yield a True Positives 
value of 92.86% and a True Negatives value of 91.5%. The same 
idea may be extended to using a model which is based on four 
different context trees for input strings of 3 bp each. Such partitions 
increase the position-dependant sensitivity of the model, however it 
leads to some loss of information regarding the dependence of border 
regions between the partial sequences. An optimization procedure, 
thus, may be performed for finding the optimal split in terms of the 
recognition efficiency. 

Sliding Windows: the division of an input string into sub- 
strings can be performed either to obtain mutual exclusive sub- 
strings, that do not share conunon sequences, or in the manner of a 
sliding-window, that is to say forming substrings from incremental 
lengths along the sequence. For example, in the problem of 
identification of DNA coding regions, at each step an input string of 
a fixed length contains several new bp in comparison to the input 
string of the last step. In this manner, the model is gradually 
modified and the classification can be indicated in earlier stages. 

Non-Homogeneous models: in non-homogenous models, 

different alternating context tree models are interlaced in a 
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predetermined sequence that can capture cyclic phenomena in the 
string. For example, in the coding, non-coding classilBcation 
problem, it is known that coding DNA segments have an internal 
structure of triplets (or codons) and the position in the triplet is 
5 important for correct classification. It is possible to grow three 
different context trees - one for each one of the relative positions of 
the basis acids in the codon - and when handling large strings of 
coding, or suspected coding DNA, the predictions of the three trees 
are interlaced together and generate the prediction of the DNA 
10 segment 

To illxistrate the above, an experiment was carried out using 
the benchmark dataset of Fickett and Tung, (1992), for accuracy 
discrimination between coding and non-coding DNA segments of 
length 54/108/162 bp respectively. Accuracies of 85.0/90.5/92.6 

15 respectively were obtained using the non-homogenous method 
compared to accuracies of 74.3/77.5/80.5 using only one context 
tree. The non-homogenous results are better than the previously best 
results on the dataset using other nonhomogenous methods: 
80.7/84.9/88.0. That is to say the present method provided an 

20 approximate 4% improvement. 

Predictive vs. Non Predictive models: Predictive models 
assign non-zero probabilities to events that were never observed in 
the training set, while non-predictive models assign zero probability 
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to events that were never observed in the training set. The choice of 
the appropriate model is based on the feasibility of the un-observed 
events: if unobserved events are not feasible, than the non-predictive 
formula is more accurate. Once again, the use of predictive vs non 
predictive models can be checked against cross-validation properties. 

Selection of the tree truncation threshold: the threshold is one 
of the most important default parameters and directly determines the 
size of the final model. Indirectly it determines the computational 
aspects of the algorithm, and the accuracy of the model. It may be 
optimized for each application. For example, in predictive 
applications such as time series forecasting it was empirically found 
that a smaller than default threshold improves the quality of the 
prediction. 

Tree Construction Parameter: The tree construction 
parameters proposed in the algorithm are default parameters for 
optimization. Such parameters include: i) the tree maximum depth; 
ii) the algorithm buffer size; iii) value of pruning constants; iv) the 
length of input sequences; v) the order of input symbols; vi) the 
number of nodes to grow after a leaf is reached; vii) other parameters 
indicated in Figure 6A etc. Further improvement of the model is 
possible by optimization of the default parameters in accordance 
with the conditions of each specific application. 
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In general tiie Markov model does not take into account the possibility 
of position dependence in the model nor the possibility of varying order, partial 
leafs. The stochastic models discussed herein, do take such position 
information into account. 

5 

Applications 

Use of the above context model to monitor changes in state has a wide 
range of applications. In general the model part may be applied to any process 
which can be described by different arrangements of a finite group of symbols 

10 and wherein the relationship between the symbols can be given a statistical 
expression, although, in the present disclosure, emphasis is placed on spatially 
related data. For example ~ sequences of nucleotides belonging to an alphabet 
symbol set of four letters, or a sequence of amino acids belonging to an 
alphabet set of 20 letters, or a sequence of proteins structures described by an 

15 alphabet set letters for primary secondary and 3D structure — all make good 
candidates for consideration. The comparison stage as described above allows 
for changes in the statistics of the symbol relationships to be monitored and 
thus modeling plus comparison may be applicable to any such process in which 
dynamic changes in those statistics are meaningful. 

20 In the field of biology an important application concerns the recognition 

and prediction common function amongst biological sequences, for example 
within promoters and coding sequences from amongst DNA sequences, or in 
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the recognition or prediction of three-dimensional shape and fiuiction in 
proteins. DNA sequences are sequences of four bases termed Adenine, A, 
Cytosine, C, Guanine, G, and Thiamine, T, and combinations of the four bases 
provide both acting regulatory sequences, encodmg sequences for amino-acids 
5 to be used in proteins, intron sequences which are transcribed yet not translated 
and and non-coding sequences of yet unidentified function, including high, 
moderate and low repetitive have a certam similarity in the statistical 
distribution of the bases, however, such similarity is not at all trivial for 
detection. 

10 In general, any kind of biological sequence can be analyzed in the above 

way to identify or predict properties. Provided groups of sequences can be 
identified that share a common functionality or structure, they may be used as a 
training data base to construct a tree model and later identify whether a new 
sequence belongs to that group or has a similar feature or structure that might 

1 5 apply a certain relation to that group. 

Thus, for example, promoters that form a certain group of DNA 
sequences that share a common functionality can be used as a training data base 
to construct a tree and later identify whether a new sequence belongs to that 
group. 

20 Numerical Example 

By way of example, an experimental attempt to identify promoter 

regions in E. coli DNA was described in Fig 3b. An input string of 
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observations is introduced to a feature transformation module along with a 
context-tree model. The tree model is constructed by the methods described 
above based on a training set generated by concatenating several 12 base- 
pairs sequences of known promoters. The context-tree model represents the 
5 probabilistic structure in the training set but with a huge reduction of 
dimensionality. Instead of representing all conditional statistics that emerged 
from the training set, the tree represents only those statistics that were found 
to be significant. The transformation computes the log-likelihood of the input 
string being a "promoter" or a "non-promoter" according to the tree structure 
10 (i.e., the feature vector, y). The log-likelihood values are then introduced to 
the classifier, which decides based on a given threshold whether to classify 
this input string as a promoter or not. 

More specifically, in the experiment, 238 DNA sequences taken from E- 
Coli promoters (2 six mers) were coded by a niunerical alphabet (explain A=l, 
15 C=2, G=3, T=4) and arranged into a long source string S. A context tree 
modeler was set to build a tree jfrom the data using a context length and deepest 
node, of 5, that is to say, at each stage of building of the tree only the five 
previous symbols were retained and a buffer having symbol length 12 was set. 
Preferably, the buffer length serves to reset the process of growing the context 
20 tree. 

In the experiment a training set of 238 E-Coli promoter sequences were 
available. In a first experiment — performed without cross validation - all the 
training set was used to built a tree model and then the model was used to 
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specify the likelihood for each promoter - i.e., the promoter itself was also part 
of the training set. In the second experiment 238 different training sets were 
used to calculate the likelihood of each promoter — each training set containing 
237 promoters, i.e., without the promoter itself, 

5 In the cross-validation experiment, reference context trees were built in 

each case for sequences of 237 promoters and then an attempt was* made to 
identify the 238*** promoter from the source string S. 

The above experiment is referred to as a cross-validation experiment, 
and its purpose is to test the quality of a model built with 237 promoters on 
10 promoter No. 238 which did not take part in the tree growing process. 

The result achieved was that the third context tree in each case, namely 
that built from a concatenation of three replicas of the string S, gave correct 
identification of the 238* string in 75 % of cases - without optimizing the tree 
parameters at all, thus providing an illustrative example. In a later experiment, 
15 by optimizing certain tree parameters, the accuracy level was raised to 94%. 

The first experiment was repeated with position dependence and with 
sub trees and improvements were obtained. Thus, in the sub trees experiment 
each promoter was presented by two separate 6 base pairs trees instead of one 
12 base pair tree. The reason is that the E Coli promoter is composed of two 
20 6bp sites that are separated by several nucleotides. When calculating the 
likelihood of a promoter, the first 6bp was calculated using the first tree model 
and the second 6bp using the second model. 
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Stochastic models were built in the way described above in order to 
distinguish between coding and non-coding DNA sequences. The models 
demonstrated substantial species independence, although more specific species 
dependent models may provide greater accuracy. The experiments concerned 
the construction of a coding model and a non-coding model each using 200 
DNA strings divided into test sets and validation sets respectively. 

Non-homogeneous trees that were applied to DNA segments of length 
162 bp and a zero threshold yielded a 94.8 percent of correct rejections (True 
negative) and 93 percent of correct acceptance (True Positive). Using another 
model with different threshold, we obtained a 99.5% of correct rejections for 
the coding model and a 21% of false rejections. Using the same model for the 
non-coding model, the percentage of correct rejections was 100% and the 
percentage of false rejections was 12%. 

It was noted that the coding model had a much smaller context tree than 
the non-coding model. 

Other DNA applications include the ability to distinguish, analyze and 
classify: i) exons and introns; ii) Splice Sites; iii) Terminators; vi) Poly A; v) 
Proteins; vi) other biological sequences that share functionality features. There 
is also provided the possibility of predicting the structure of a protein. 

Medical applications for the above embodiments are numerous. Any 
signal representing a body function can be discretized to provide a finite set of 
symbols. The symbols appear in sequences which can be modeled and changes 
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in the sequence can be indicated by the comparison step referred to above. 
Thus, medical personnel are provided with a system that can monitor a selected 
bodily function and which is able to provide an alert only when a change 
occurs. The method is believed to be more sensitive than existing monitoring 
5 methods. 

A fijrther application of the modeling and comparison process is image 
processing and comparison. A stochastic model of the kind described above 
can be built to describe an image, for example a medical image. The stochastic 
model may then be used to compare other images in the same way that it 

10 compares other data sequences. Such a system is useful m automatic screening 
of medical image data to identify features of interest. The system can be used 
to compare images of the same patient taken at different times, for example to 
monitor progress of a tumor. Alternatively, it could be used to compare unages 
taken from various patients already diagnosed with a given condition, against 

15 an image showing the same view of a patient being tested. 

A further application of the modeling and comparison process is in 

. pharmaceutical research. A protein (e.g., an enzyme, receptor, ligand, etc.) 

carrying out a particular function is required. A series of proteins which all 

carry out the required function are sequenced or have a known sequence and a 

20 model derived from all the sequences together may define the required 

sequence for the desired protein. At a higher level of organization, a protein 

(e.g., an enzyme, receptor, ligand, etc.) carrying out a particular function is 

required. A series of proteins which all carry out the required function are 
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structurally analyzed or have a known structural analysis (obtained, for 
example, by X-ray cristalography) and a model derived from all the sequences 
together may define the required structure for the desired protein. 

A further application of the modeling and comparison process is to 
5 analyze nucleic acid microarrays, such as DNA chips, to determine, for 
example the nature (e.g., the base sequence) of the materials to which they have 
been exposed (e.g., nucleic acids of unknown or undetermined sequence). 

A further application of the modeling and comparison process is to 
analyze protein (e.g., antigens or antibodies) microarrays, such as protein chips, 
10 to determine, for example the nature of the materials to which they have been 
exposed (e.g., antibodies or antigens, respectively). 

A further application of the modeling and comparison process as 
described above is in forecasting. For example, such forecast can be applied to 
natural data - such as weather conditions, or to financial related data sucha s 

15 stock markets. As the model expresses a statistical distribution of the sequence, 
it is able to give a probability forecast for a next expected symbol given a 
particular received sequence, A further application of the present embodiments 
is to sequences of multi-input single output data, such as records in a database, 
which may for example represent dififerent features of a given symbol. 

20 Considering a database with records that arrive at consecutive times, the 
algorithm, (when extended to multi-dimensions), may compare a sequence of 
records and decide whether they have similar statistical properties to the 
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previous records in the database. The comparison can be used to detect changes 
in the characteristics of the source which generates the records in the database. 

Likewise the database may already be in place, in which case the 
algorithm may compare records at different locations in the database. 

It is appreciated that certain features of the invention, which are, for 
clarity, described m the context of separate embodiments, may also be provided 
in combination in a single embodiment. Conversely, various features of the 
invention which are, for brevity, described in the context of a single 
embodiment, may also be provided separately or in any suitable 
subcombination. 

It will be appreciated by persons skilled in the art that the present 
invention is not limited to what has been particularly shown and described 
hereinabove. Rather the scope of the present invention is defined by the 
appended claims and includes both combinations and subcombinations of the 
various features described hereinabove as well as variations and modifications 
thereof which would occur to persons skilled in the art upon reading the 
foregoing description. 
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Claims 

1. Apparatus for building a stochastic model of a data sequence, 
said data sequence comprising spatially related symbols selected from a finite 
symbol set, the apparatus comprising: 

an input for receiving said data sequence, 

a tree builder for expressing said symbols as a series of counters within 
nodes, each node having a counter for each symbol, each node having a 
position within said tree, said position expressing a symbol sequence and each 
counter indicating a number of its corresponding symbol which follows a 
symbol sequence of its respective node, and 

a tree reducer for reducing said tree to an irreducible set of conditional 
probabilities of relationships between symbols in said input data sequence. 

2. Apparatus according to claim 1, said tree reducer comprising a 
tree pruner for removing from said tree any node whose counter values are 
within a threshold distance of counter values of a preceding node in said tree. 

3. Apparatus according to claim 2, wherein said threshold distance 
and tree construction parameters are user selectable. 
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4. The apparatus of claim 3, wherein said user selectable parameters 
further comprise a tree maximum depth. 



5. The apparatus of claim 3, wherein said user selectable parameters 
further comprise an algoriflmi buffer size 

6. The apparatus of claim 3, wherein said user selectable parameters 
further comprise values of at least one pruning constants. 

7. The apparatus of claim 3, wherein said user selectable parameters 
further comprise a length of input sequences. 

8. The apparatus of claim 3 wherein, said user selectable parameters 
further comprise an order of input symbols. 

9. Apparatus according to claim 2, wherein said tree reducer furtihier 
comprises a path remover operable to remove any path within said tree that is a 
subset of another path within said tree. 
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10. Apparatus according to claim 1, wherein said sequential data is a 
string comprising consecutive symbols selected from a finite set. 

11. The apparatus of claim 10, further comprising an input string 
permutation unit for canying out permutations and reorganizations of the input 
string using external information about a process generating said string. 

12. Apparatus according to claim 9, wherein said string is a nucleic 
acid sequence. 

13. The Apparatus of claim 12, wherein said string is a promoter and 
said tree is operable to identify other promoters. 

14. The apparatus of claim 12, wherein said string is a string of 
coding DNA, and said tree is operable to identify other coding strings. 

15. The apparatus of claim 12, wherein said string is a string of non- 
coding DNA, and said tree is operable to identify other non-coding strings. 
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16. The apparatus of claim 12, wherein said string is a DNA string 
and said tree is operable to identify poly-A terminators. 

17. The apparatus of claim 12, wherein said string has a given 
property, and said tree is operable to identify other strings having said given 
property. 

18. Apparatus according to claim 9, wherein said string is an amino- 
acid sequence and the symbols comprise at least some of the 20 amino-acids. 

19. The apparatus of claim 9, wherein said string is an amino acid 
string and said tree is operable to identify at least one of primary, secondary 
and three dimensional protein structure. 

20. ' The apparatus of claim 18, wherein said string has a given 
property, and said tree is operable to identify other strings having said given 
property. 

21. Apparatus according to claim 12, wherein said nucleic acid 
sequence is a promoter sequence and another nucleic acid sequence is a non- 
promoter sequence, wherein said stochastic modeler is operable to build 
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models of said promoter sequence and said non-promoter sequence and said 
comparator is operable to compare a third nucleic acid sequence with each of 
said models to determine whether said third sequence is a promoter sequence or 
a non-promoter sequence. 

22. Apparatus according to claim 12, wherein said nucleic acid 
sequence is a coding sequence and another nucleic acid sequence is a non- 
coding sequence, wherem said stochastic modeler is operable to build models 
of said coding sequence and said non-coding sequence and said comparator is 
operable to compare a third nucleic acid sequence with each of said models to 
determine whether said third sequence is a coding sequence or a non-coding 
sequence. 

23. Apparatus according to claim 12, wherein said nucleic acid 
sequence is a repetitive sequence and another nucleic acid sequence is a non- 
repetitive sequence, wherein said stochastic modeler is operable to build 
models of said repetitive sequence and said non-repetitive sequence and said 
comparator is operable to compare a third nucleic acid sequence with each of 
said models to determine whether said third sequence is a repetitive sequence 
or a non-repetitive sequence. 
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24. Apparatus according to claim 12, wherein said nucleic acid 
sequence is a non-coding sequence and another nucleic acid sequence is a 
coding sequence, wherein said stochastic modeler is operable to build models 
of said non-coding sequence and said coding sequence and said comparator is 
operable to compare a third nucleic acid sequence with each of said models to 
determine whether said third sequence is a non-coding sequence or a coding 
sequence. 

25. Apparatus according to claim 12, wherein said nucleic acid 
sequence is an exon sequence, wherein said stochastic modeler is operable to 
build a model of said exon sequence and said comparator is operable to 
compare a second nucleic acid sequence with said model to determine whether 
said second sequence is an exon sequence. 

26. Apparatus according to claim 12, wherein said nucleic acid 
sequence is an intron sequence, wherein said stochastic modeler is operable to 
build a model of said intron sequence and said comparator is operable to 
compare a second nucleic acid sequence with said model to determine whether 
said second sequence is an intron sequence. 
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27, Apparatus according to claim 1, wherein said stochastic model is 
refinable using further data sequences, thereby to define a structure of a 
common attribute of said data sequences. 

28. Apparatus for determining statistical consistency in spatially 
related data comprising a finite set of symbols, the apparatus comprising 

a sequence input for receiving said spatially related data, 

a stochastic modeler for producing at least one stochastic model from at 
least part of said spatially related data, 

and a comparator for comparing said stochastic model with a prestored 
model, tiiereby to determine whether there has been a statistical change in said 
data. 

29. Apparatus according to claim 28, wherein said stochastic modeler 
comprises: 

a tree builder for e>q)ressing said symbols as a series of counters within 
nodes, each node having a counter for each symbol, each node having a 
position witiiin said tree, said position expressing a symbol sequence and each 
counter indicating a number of its corresponding symbol which follows a 
symbol sequence of its respective node, and 
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a tree reducer for reducing said tree to an irreducible set of conditional 
probabilities of relationships between symbols in said input data sequence. 

30. Apparatus according to claim 28, said prestored model being a 
model constructed using another part of said spatially related data. 

31. Apparatus according to claim 28, said comparator comprising a 
statistical processor for determining a. statistical distance between said 
stochastic model and said prestored model. 

32. Apparatus according to claim 28, wherein said comparator 
comprises a statistical processor for determining a difference in statistical 
likelihood between said stochastic model and said prestored model. 

33. Apparatus according to claim 31, said statistical distance being a 
relative complexity measure. 

34. Apparatus according to claim 31, wherein said statistical distance 
comprises an SPRT statistic. 
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35. Apparatus according to claim 31, wherein said statistical distance 
comprises an MDL statistic. 

36. Apparatus according to claim 31, wherein said statistical distance 
comprises a Multinomial goodness of fit statistic. 

37. Apparatus according to claim 31, wherein said statistical distance 
comprises a Weinberger Statistic. 

38. Apparatus according to clsdm 31, wherein said statistical distance 
comprises a KL statistic. 

39. Apparatus according to claim 29, said tree reducer comprising a 
tree pruner for removing from said tree any node whose counter values are 
within a threshold distance of counter values of a preceding node in said tree. 

40. Apparatus according to claim 39, wherein said threshold distance 
is user selectable. 
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41. The apparatus of claim 40, wherein user selectable parameters 
further comprise a tree maximum depth. 

42. The apparatus of claim 40, wherein user selectable parameters 
further comprise an algorithm buffer size 

43. The apparatus of claim 40, wherein user selectable parameters 
further comprise a value of at least one pruning constant. 

44. The apparatus of claim 40, wherein user selectable parameters 
further comprise a length of input sequences. 

45. The apparatus of claim, wherein user selectable parameters 
further comprise an order of input symbols. 

46. Apparatus according to claim 39, wherein said tree reducer 
further comprises a path remover operable to remove any path within said tree 
that is a subset of another path within said tree. 
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47. Apparatus according to claim 28, wherein said data comprises a 
nucleic acid sequence. 

48. Apparatus according to claim 28, wherein said data comprises an 
amino-acid sequence. 

49. Apparatus according to claim 28, wherein said sequential data is 
an output of a medical sensor sensing bodily functions. 

50. Apparatus according to claim 47, wherein said nucleic acid 
sequence is a promoter sequence and another nucleic acid sequence is a non- 
promoter sequence, wherein said stochastic modeler is operable to build 
models of said promoter sequence and said non-promoter sequence and said 
comparator is operable to compare a third nucleic acid sequence with each of 
said models to determine whether said third sequence is a promoter sequence or 
a non-promoter sequence. 

51. Apparatus according to claim 47, wherem said nucleic acid 
sequence is a coding sequence and another nucleic acid sequence is a non- 
coding sequence, wherein said stochastic modeler is operable to build models 
of said coding sequence and said non-coding sequence and said comparator is 
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operable to compare a third nucleic acid sequence with each of said models to 
determine whether said third sequence is a coding sequence or a non-coding 
sequence. 

52. Apparatus according to claim 47, wherein said nucleic acid 
sequence is a repetitive sequence and another nucleic acid sequence is a non- 
repetitive sequence, wherein said stochastic modeler is operable to build 
models of said repetitive sequence and said non-repetitive sequence and said 
comparator is operable to compare a third nucleic acid sequence with each of 
said models to determine whether said third sequence is a repetitive sequence 
or a non-repetitive sequence. 

53. Apparatus according to claim 47, wherein said nucleic acid 
sequence is a non-coding sequence and another nucleic acid sequence is a non- 
non-coding sequence, wherein said stochastic modeler is operable to build 
models of said non-coding sequence and said non-non-coding sequence and 
said comparator is operable to compare a third nucleic acid sequence with each 
of said models to determine whether said third sequence is a non-coding 
sequence or a non-non-coding sequence. 

54. Apparatus according to claim 31, wherein said data sequence 
comprises image data of a &st image. 
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55. Apparatus according to claim 54, said distance being indicative 
of a statistical distribution within said image. 

56. Apparatus according to claim 55, further comprising an image 
comparator for comparing said statistical distribution with a statistical 
distribution of another image. 

57. Apparatus according to claim 56, said other image being of a 
same view as said first image taken at a different time, said distance being 
indicative of time dependent change. 

58. Apparatus according to claim 54, said image data comprising 
medical imaging data, said statistical distance being indicative of deviations of 
said data from an expected norm. 

59. Apparatus according to claim 31, applicable to a database to 
perform data mining on said database. 
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60. Apparatus according to claim 28, said stochastic model being 
constructed from descriptions of a plurality of enzymes for carrying out a given 
task, said model thereby providing a generic structural description of an 
enzyme for carrying out said task. 

61. Apparatus according to claim 28, said model being usable to 
analyze results of a nucleic acid micro array. 

62. Apparatus according to claim 28, said model being usable to 
analyze results of a protein microarray. 

63 . A method of designing a protein for carrying out a predetermined 
task, the method comprising: 

taking a plurality of proteins known to carry out said predetermined 

task, 

constructing a stochastic model using an amino acid sequence of said 
plurality of proteins, 

using said stochastic model to predict a protein sequence. 
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64. A method of designing a protein for carrying out a predetermined 
task, the method comprising: 

taking a plurality of proteins known to carry out said predetermined 
task, 

constructing a stochastic model using the 3D structure of said plurality 
of proteins, 

using said stochastic model to determine a protein structure. 

65. A method of distinguishing between biological sequences of a 
first kind and biological sequences of a second kind, each kind being 
e3q>ressible in terms of a same finite set of symbols, the method comprising: 

obtaining a statistically significant set of sequences of said fiirst kind and 
building a stochastic model thereof, 

obtaining a statistically significant set of sequences of said second kind 
and building a stochastic model thereof, and 

taking a further sequence and comparing it with each stochastic model 
to detemnine whether it belongs to either set. 

66. The method of claim 65, wherein said biological sequences are 
nucleic acid sequences. 
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67. The method of claim 65, wherein said biological sequences are 
amino acid sequences. 



68. The method of claim 66, wherein the sequences of said first kind 
are promoter sequences. 



69. The method of claim 66, wherein the sequences of said first kind 
are coding and the sequences of said second kind are non-encoding sequences. 



70. The method of claim 69, wherein the sequences are non-species 
specific, thereby constructing models which are non-species specific. 
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The counter context tree constructed from x'^=a,b,a,b; 
partial application of step 1 .2 
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The pruned counter context-tree of the string 
(a,b,a,b,c,a,b,a,b,c,a,b,a,b,a,b,c) replicated 8 times 
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Fig. 11 



The context tree containing vectors of conditional probabilities 
P(x/s) as obtained from the counter context-tree in figure A4. 
Optimal contexts are represented by the bolded frame 
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Shifts in the process underlying normal standard deviation 

-X=1 ,1 .5,2 (number of runs for each process properties is equal to 50) 
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Shift in the process underlying normal standard deviation 
-^=0.5 (number of runs equal to 50) 
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The SCC control chart for "in-contror data 
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Shewhart SPC Xchart-"in-contror data (solid line) and "out-of-control" 
data (thin line) 
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Fig. 21 

Shewhart SPC S chart-"in-contror data (solid line) and "out-of-control" 
data (dashed line) 
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Fig. 22 

Shewhart SPC X chart for N=125 sample size -"in-control" data 
(solid line) and "out-of-contrordata (dashed line) 
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(57) Abstract: Apparatus for building a stochastic model of a spa- 
tially related data sequence, the data sequence comprising symbols 
selected from a ^nite symbol set, the apparatus comprising an in- 
put for receiving the data sequence, a tree builder (42) for express- 
ing said symbols as a series of counters (46) within nodes, each 
node having a counter for each symbol, each node having a posi- 
tion within the tree, the position expressing a symbol sequence and 
each counter indicating a number of its corresponding symbol which 
follows a symbol sequence of its respective node, and a tree reducer 
for reducing the tree to an irreducible set of conditional probabilities 
of relationships between symbols in the input data sequence. 



